Skip to content

feat(trace sampler): implement error tracking standalone mode#1314

Open
thieman wants to merge 17 commits intomainfrom
thieman/error-tracking-standalone
Open

feat(trace sampler): implement error tracking standalone mode#1314
thieman wants to merge 17 commits intomainfrom
thieman/error-tracking-standalone

Conversation

@thieman
Copy link
Copy Markdown
Contributor

@thieman thieman commented Apr 6, 2026

Summary

Implements Error Tracking Standalone (ETS) sampling mode, matching datadog-agent/pkg/trace/agent/agent.go runSamplers.

Changes

  • transforms/trace_sampler/mod.rs — ETS check runs at the top of run_samplers before all other samplers. Traces containing errors (including exception span events) are routed to the error sampler; non-error traces are dropped immediately. Dropped non-error ETS traces are forwarded to intake with DroppedTrace=true (suppressing SSS and analytics events), matching Go agent behavior.
    • For OTLP traces when the probabilistic sampler is disabled, _dd.p.dm and _sampling_priority_v1 are pre-assigned inside the ETS block via otlp_pre_sample() — mirroring OTLPReceiver.createChunks in the Go agent. User-set priorities receive dm="-4" (manual); probabilistically-sampled traces receive dm="-9".
  • encoders/datadog/traces/mod.rs — emits _dd.error_tracking_standalone.error = "true" as a chunk tag (only for traces that actually contain an error span or exception span event) and sets X-Datadog-Error-Tracking-Standalone: true on outbound requests when ETS is enabled. The HTTP header is pre-built at construction time to avoid per-request allocation.
  • common/datadog/mod.rs — adds DECISION_MAKER_MANUAL constant ("-4") for user/manual sampling decisions.
  • common/datadog/apm.rs — ETS enabled via apm_config.error_tracking_standalone.enabled (YAML) or DD_APM_ERROR_TRACKING_STANDALONE_ENABLED (env var), following the same alias pattern as the rare sampler. Bool method docstrings standardized to "Returns true if ...".
  • common/datadog/request_builder.rs — adds additional_headers() hook to EndpointEncoder trait for encoder-specific request headers.
  • test/correctness/otlp-traces-ets/ — correctness test comparing ETS behavior against a DDA baseline, with error_rate: 0.1 to ensure a meaningful number of error traces pass through.

Behavioral notes

  • ETS takes priority over all other samplers (rare, probabilistic, priority)
  • Error detection includes both span.error != 0 and _dd.span_events.has_exception = "true"
  • Error traces: routed through error sampler (TPS-limited); if kept, tagged with ETS chunk tag and forwarded with ETS request header
  • Non-error traces: forwarded with DroppedTrace=true, no SSS or analytics event fallback; ETS chunk tag is not written (tag is only present on error traces)
  • OTLP pre-sampling (dm + priority) is computed inside the ETS block via a dedicated otlp_pre_sample() method, keeping the computation co-located with its only consumer

Test plan

  • ets_keeps_trace_with_error — error trace kept by error sampler
  • ets_drops_trace_without_error — non-error trace dropped by run_samplers
  • ets_forwards_dropped_trace_with_dropped_flag — non-error ETS trace forwarded with DroppedTrace=true; SSS not applied
  • ets_keeps_trace_with_exception_span_event — exception span events count as errors
  • ets_disabled_uses_normal_sampling — ETS disabled falls through to normal sampling path
  • ets_otlp_non_error_gets_presample_priority_and_dm — non-error OTLP trace gets priority=AutoKeep, dm="-9" before ETS drop
  • ets_otlp_error_gets_presample_priority_and_dm — error OTLP trace gets priority=AutoKeep, dm="-9" when kept
  • ets_otlp_probabilistic_path_skips_presample — probabilistic sampler path: no pre-sampling applied
  • ets_non_otlp_unaffected_by_presample — non-OTLP traces unaffected by OTLP pre-sampling logic
  • ets_otlp_user_priority_gets_manual_dm — user-set priority gets dm="-4" (manual)
  • ets_header_present_when_enabled / ets_header_absent_when_disabled — HTTP header behavior
  • ets_chunk_tag_present_for_error_trace — chunk tag written for error traces
  • ets_chunk_tag_absent_for_non_error_trace — chunk tag not written for non-error traces
  • ets_chunk_tag_absent_when_disabled — chunk tag not written when ETS is disabled
  • Config tests: disabled by default, enabled via YAML, enabled via env var, env var overrides YAML
  • Correctness test (otlp-traces-ets): DDA vs ADP output matches with no differences detected

Stacked on #1311.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 6, 2026 17:21
@dd-octo-sts dd-octo-sts bot added area/core Core functionality, event model, etc. area/components Sources, transforms, and destinations. encoder/datadog-traces Datadog Traces encoder. transform/trace-sampler Trace Sampler synchronous transform. labels Apr 6, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 6, 2026

Binary Size Analysis (Agent Data Plane)

Target: bdcdc6c (baseline) vs b22e492 (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 26.47 MiB
Comparison Size: 26.47 MiB
Size Change: -3.78 KiB (-0.01%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module File Size Symbols
core -4.17 KiB 46
[Unmapped] -1.70 KiB 1
saluki_components::common::datadog +694 B 22
saluki_components::encoders::datadog +578 B 5
saluki_components::transforms::trace_sampler +568 B 1
serde_core -153 B 11
http +148 B 2
agent_data_plane::cli::run +124 B 1
saluki_core::data_model::event +83 B 2
[sections] +46 B 5
saluki_core::topology::shutdown +24 B 1
unicode_segmentation +16 B 1
tokio +8 B 28
agent_data_plane::components::tag_filterlist +4 B 1
figment +4 B 1
saluki_common::task::instrument +0 B 2
anyhow +0 B 17
tracing_core +0 B 4

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +17.3Ki  [NEW] +17.2Ki    saluki_components::encoders::datadog::traces::TraceEndpointEncoder::encode_tracer_payload::_{{closure}}::_{{closure}}::he766a0eb48dc2ca7
  [NEW] +13.2Ki  [NEW] +13.0Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h521a21c05bafb1bd
  [NEW] +10.5Ki  [NEW] +10.3Ki    _<saluki_common::task::instrument::InstrumentedTask<F> as core::future::future::Future>::poll::h4137df811c47120b
  [NEW] +9.71Ki  [NEW] +9.60Ki    saluki_components::common::datadog::apm::ApmConfig::from_configuration::hdd7d8c6f0bfa77f8
  [NEW] +5.71Ki  [NEW] +5.55Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush::_{{closure}}::h9941a28ad84c7d85
  [NEW] +3.34Ki  [NEW] +3.17Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::try_split_request::_{{closure}}::h7112218fdd761b82
  [NEW] +2.62Ki  [NEW] +2.46Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush_inner::_{{closure}}::h671ea09455371842
  [NEW] +2.26Ki  [NEW] +2.12Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::create_request::h3dcd64c14c05724e
  [NEW] +1.84Ki  [NEW] +1.76Ki    tokio::runtime::task::raw::poll::h5f0300b45ba2987f
  [NEW] +1.78Ki  [NEW] +1.71Ki    tokio::runtime::task::raw::poll::h44baad23952b6723
  [DEL] -1.78Ki  [DEL] -1.71Ki    tokio::runtime::task::raw::poll::h99c86f35617a7f38
  [DEL] -1.84Ki  [DEL] -1.76Ki    tokio::runtime::task::raw::poll::h22bad6a42a6d9e4c
  -0.2% -2.18Ki  -0.1%    -718    [131 Others]
  [DEL] -2.62Ki  [DEL] -2.46Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush_inner::_{{closure}}::hc874e837eec984f6
  [DEL] -3.34Ki  [DEL] -3.18Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::try_split_request::_{{closure}}::hde2ee702a30f52ff
  [DEL] -3.93Ki  [DEL] -3.78Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h0d466dc6294304ee
  [DEL] -5.72Ki  [DEL] -5.57Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::flush::_{{closure}}::hdba5aa9ff3fada65
  [DEL] -9.63Ki  [DEL] -9.51Ki    saluki_components::common::datadog::apm::ApmConfig::from_configuration::hbd8a0046fecbdb97
  [DEL] -10.5Ki  [DEL] -10.3Ki    _<saluki_common::task::instrument::InstrumentedTask<F> as core::future::future::Future>::poll::hac79458d260639d8
  [DEL] -13.2Ki  [DEL] -13.1Ki    saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h1199aefcd5f2ab2d
  [DEL] -17.3Ki  [DEL] -17.1Ki    saluki_components::encoders::datadog::traces::TraceEndpointEncoder::encode_tracer_payload::_{{closure}}::_{{closure}}::hfafcbd98d1699351
  -0.0% -3.78Ki  -0.0% -2.29Ki    TOTAL

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements Error Tracking Standalone (ETS) sampling mode for the trace sampler, matching the behavior of datadog-agent/pkg/trace/agent/agent.go runSamplers. ETS is a high-priority sampling mode that:

  • Keeps all error traces by routing them through the error sampler (TPS-limited)
  • Immediately drops all non-error traces without consulting other samplers
  • Suppresses single-span sampling and analytics events for dropped traces
  • Tags kept error traces with _dd.error_tracking_standalone.error = "true" chunk tag

Changes:

  • Added ets_error: bool field to TraceSampling to track whether a trace was kept by ETS mode
  • Added sampling_mut() accessor to Trace for mutable access to sampling metadata
  • Integrated ETS as the highest-priority sampler in run_samplers that runs before all other samplers
  • Updated encoder to emit the ETS error chunk tag when ets_error is set
  • Added comprehensive test coverage for all ETS behavior scenarios
  • ETS is disabled by default and can be enabled via configuration

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
lib/saluki-core/src/data_model/event/trace/mod.rs Added ets_error field to TraceSampling and sampling_mut() accessor to Trace
lib/saluki-components/src/transforms/trace_sampler/mod.rs Implemented ETS logic at the top of run_samplers with comprehensive test coverage
lib/saluki-components/src/encoders/datadog/traces/mod.rs Emits _dd.error_tracking_standalone.error chunk tag for kept ETS traces

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

(
"apm_config.error_tracking_standalone.enabled",
"apm_error_tracking_standalone",
),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 6, 2026

Regression Detector (Agent Data Plane)

Regression Detector Results

Run ID: df7f51dd-b5d2-4e7b-87ed-842b4013705a

Baseline: bdcdc6c
Comparison: b22e492
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
otlp_ingest_logs_5mb_cpu % cpu utilization +0.63 [-4.30, +5.55] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_throughput ingress throughput -0.02 [-0.15, +0.11] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_memory memory utilization -8.05 [-8.53, -7.57] 1 (metrics) (profiles) (logs)

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_500mb_3k_contexts_cpu % cpu utilization +2.17 [+0.82, +3.53] 1 (metrics) (profiles) (logs)
dsd_uds_512kb_3k_contexts_cpu % cpu utilization +1.74 [-56.17, +59.65] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_ottl_filtering_5mb_cpu % cpu utilization +1.03 [-1.45, +3.51] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_5mb_cpu % cpu utilization +0.99 [-1.22, +3.21] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_cpu % cpu utilization +0.63 [-4.30, +5.55] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_ottl_transform_5mb_memory memory utilization +0.59 [+0.33, +0.84] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_medium memory utilization +0.44 [+0.24, +0.63] 1 (metrics) (profiles) (logs)
dsd_uds_500mb_3k_contexts_throughput ingress throughput +0.20 [+0.07, +0.32] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_ottl_filtering_5mb_memory memory utilization +0.17 [-0.16, +0.50] 1 (metrics) (profiles) (logs)
dsd_uds_512kb_3k_contexts_memory memory utilization +0.16 [-0.01, +0.33] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_ultraheavy memory utilization +0.16 [+0.03, +0.28] 1 (metrics) (profiles) (logs)
dsd_uds_500mb_3k_contexts_memory memory utilization +0.15 [-0.01, +0.31] 1 (metrics) (profiles) (logs)
dsd_uds_1mb_3k_contexts_memory memory utilization +0.09 [-0.08, +0.26] 1 (metrics) (profiles) (logs)
dsd_uds_100mb_3k_contexts_memory memory utilization +0.07 [-0.11, +0.25] 1 (metrics) (profiles) (logs)
dsd_uds_1mb_3k_contexts_cpu % cpu utilization +0.04 [-54.81, +54.88] 1 (metrics) (profiles) (logs)
dsd_uds_10mb_3k_contexts_throughput ingress throughput +0.02 [-0.13, +0.17] 1 (metrics) (profiles) (logs)
otlp_ingest_metrics_5mb_throughput ingress throughput +0.01 [-0.12, +0.14] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_5mb_throughput ingress throughput +0.00 [-0.02, +0.02] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_ottl_transform_5mb_throughput ingress throughput +0.00 [-0.02, +0.02] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_ottl_filtering_5mb_throughput ingress throughput -0.00 [-0.02, +0.02] 1 (metrics) (profiles) (logs)
dsd_uds_100mb_3k_contexts_throughput ingress throughput -0.00 [-0.03, +0.03] 1 (metrics) (profiles) (logs)
dsd_uds_1mb_3k_contexts_throughput ingress throughput -0.00 [-0.06, +0.06] 1 (metrics) (profiles) (logs)
dsd_uds_512kb_3k_contexts_throughput ingress throughput -0.01 [-0.06, +0.05] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_throughput ingress throughput -0.02 [-0.15, +0.11] 1 (metrics) (profiles) (logs)
dsd_uds_100mb_3k_contexts_cpu % cpu utilization -0.09 [-6.43, +6.25] 1 (metrics) (profiles) (logs)
quality_gates_rss_idle memory utilization -0.10 [-0.13, -0.08] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_heavy memory utilization -0.15 [-0.28, -0.01] 1 (metrics) (profiles) (logs)
dsd_uds_10mb_3k_contexts_memory memory utilization -0.16 [-0.34, +0.02] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_5mb_memory memory utilization -0.19 [-0.44, +0.06] 1 (metrics) (profiles) (logs)
quality_gates_rss_dsd_low memory utilization -0.25 [-0.44, -0.06] 1 (metrics) (profiles) (logs)
dsd_uds_10mb_3k_contexts_cpu % cpu utilization -0.45 [-31.98, +31.07] 1 (metrics) (profiles) (logs)
otlp_ingest_traces_ottl_transform_5mb_cpu % cpu utilization -0.77 [-2.95, +1.41] 1 (metrics) (profiles) (logs)
otlp_ingest_metrics_5mb_cpu % cpu utilization -4.23 [-11.58, +3.11] 1 (metrics) (profiles) (logs)
otlp_ingest_metrics_5mb_memory memory utilization -4.81 [-5.00, -4.63] 1 (metrics) (profiles) (logs)
otlp_ingest_logs_5mb_memory memory utilization -8.05 [-8.53, -7.57] 1 (metrics) (profiles) (logs)

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed observed_value links
quality_gates_rss_dsd_heavy memory_usage 10/10 114.75MiB ≤ 140MiB (metrics) (profiles) (logs)
quality_gates_rss_dsd_low memory_usage 10/10 34.47MiB ≤ 50MiB (metrics) (profiles) (logs)
quality_gates_rss_dsd_medium memory_usage 10/10 54.19MiB ≤ 75MiB (metrics) (profiles) (logs)
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 169.07MiB ≤ 200MiB (metrics) (profiles) (logs)
quality_gates_rss_idle memory_usage 10/10 21.28MiB ≤ 40MiB (metrics) (profiles) (logs)

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

Copilot AI review requested due to automatic review settings April 6, 2026 17:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dd-octo-sts dd-octo-sts bot removed the area/core Core functionality, event model, etc. label Apr 6, 2026
Copilot AI review requested due to automatic review settings April 6, 2026 17:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 6, 2026

[Claude Sonnet 4.6] Related to #1134.

}

// ETS: suppress single span sampling and analytics events for dropped traces.
// logic taken from: https://github.com/DataDog/datadog-agent/blob/be33ac1490c4a34602cbc65a211406b73ad6d00b/pkg/trace/agent/agent.go#L976
Copy link
Copy Markdown
Contributor Author

@thieman thieman Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Sonnet 4.6] Note: the if self.error_tracking_standalone { return false; } immediately below was a pre-existing stub from main — the field was wired to config and this SSS/analytics suppression guard was in place, but the core ETS sampling logic in run_samplers was missing. This PR completes the implementation.

@thieman thieman marked this pull request as ready for review April 6, 2026 19:03
@thieman thieman requested a review from a team as a code owner April 6, 2026 19:03
Copilot AI review requested due to automatic review settings April 6, 2026 19:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thieman thieman force-pushed the thieman/rare-sampler branch from efd14f6 to af78b1a Compare April 6, 2026 20:44
/// can be remapped via ConfigurationLoader::with_key_aliases.
#[serde(
default = "default_error_tracking_standalone_enabled",
rename = "apm_error_tracking_standalone"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be rename = "apm_error_tracking_standalone_enabled"?

agent config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuuuuuugh well I got this from our public docs so I'll go see what actually works https://docs.datadoghq.com/error_tracking/backend/getting_started/single_step_instrumentation/?tab=linuxhostorvm

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok from this code search looks like it does need to have _enabled at the end, will fix https://github.com/search?q=repo%3ADataDog%2Fdatadog-agent%20DD_APM_ERROR_TRACKING_STANDALONE&type=code

I see my mistake now, the docs I linked are for the installer, not for actually running the agent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Sonnet 4.6] Fixed — changed rename to apm_error_tracking_standalone_enabled and updated the key alias in config.rs and env var reference in tests accordingly.

@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 8, 2026

[Claude Sonnet 4.6] Flagging a concern from review of the rare sampler PR (#1311) that may still apply once both PRs are merged — worth verifying.

In the Go agent (agent.go:1069-1077), the ErrorTrackingStandalone check is the first thing in runSamplers, before the rare sampler runs. Non-error traces are immediately dropped and the rare sampler never sees them.

In the current Rust implementation, run_samplers calls rare_sampler.sample() unconditionally before the ETS guard. The ETS check only appears in process_trace on the keep=false path — so a non-error trace caught by the rare sampler (keep=true) bypasses the ETS guard entirely and would be kept when Go would drop it. The rare sampler also wastefully consumes rate-limiter tokens on traces ETS will ultimately discard.

The fix would be to move the ETS guard to the top of run_samplers, before the rare sampler call.

Implements the ETS sampling path from datadog-agent/pkg/trace/agent/agent.go:

- In run_samplers: ETS check runs at the very top (before rare/probabilistic).
  Traces with errors (including exception span events) are routed exclusively
  to the error sampler; traces without errors are dropped immediately.
- In process_trace: dropped ETS traces suppress SSS and analytics events.
- When ETS keeps a trace, sets ets_error=true on TraceSampling so the encoder
  emits _dd.error_tracking_standalone.error="true" as a chunk tag.
- Adds ets_error field to TraceSampling and sampling_mut() accessor to Trace.
- Updates the DD traces encoder to emit the ETS chunk tag when set.

6 new tests: error kept, no-error dropped, SSS suppressed, ets_error flag set,
exception span events treated as errors, ETS disabled uses normal path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thieman and others added 8 commits April 9, 2026 09:57
Follows the same pattern as enable_rare_sampler:
- YAML: apm_config.error_tracking_standalone.enabled
- Env var: DD_APM_ERROR_TRACKING_STANDALONE

Adds enable_error_tracking_standalone to the ApmConfiguration wrapper
(with rename = "apm_error_tracking_standalone"), a KEY_ALIAS mapping the
nested YAML path to the flat key, and copies the value into ApmConfig in
from_configuration. Removes the now-redundant ErrorTrackingStandaloneConfig
struct. Adds 4 config tests mirroring the rare sampler config tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… per-request allocation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hunk tag from config directly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… in source comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, keep only ETS-added links as permalinks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to avoid per-trace method call

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…one_enabled

Fixes rename and key alias to match the actual Datadog Agent config key.
Previous key was missing the `_enabled` suffix per
pkg/config/setup/apm.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman thieman force-pushed the thieman/error-tracking-standalone branch from 99425a8 to 2d5cb62 Compare April 9, 2026 14:01
@thieman thieman changed the base branch from thieman/rare-sampler to main April 9, 2026 14:01
@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 9, 2026

[Claude Sonnet 4.6] Verified — this is addressed in the current implementation. The ETS guard is at the top of run_samplers (before rare_sampler.sample()), so when ETS is enabled the function returns early on the first trace span check, and the rare sampler never runs. Non-error traces are dropped before any other sampler is consulted, matching the Go agent ordering.

thieman and others added 2 commits April 9, 2026 16:30
… var

APM_ERROR_TRACKING_STANDALONE → APM_ERROR_TRACKING_STANDALONE_ENABLED
to match the corrected config key rename.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…; gate chunk tag on error presence

Non-error traces in ETS mode are now forwarded to intake with DroppedTrace=true instead of
being silently discarded, matching Go agent behavior. The ETS chunk tag is now only written
when the trace actually contains an error span or exception span event.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 9, 2026

[Claude Sonnet 4.6]

ETS dropped trace forwarding

This PR changes how non-error traces are handled in ETS mode. Previously they were silently discarded. They're now forwarded to intake with DroppedTrace=true on the chunk metadata.

How the Go agent does it

In the Go agent, this is achieved as a side effect of the SSS (Single Span Sampling) guard. The relevant code in pkg/trace/agent/agent.go:

if !keep && !a.conf.ErrorTrackingStandalone {
    // Single span sampling: keep individually-marked spans from dropped traces
    ss := a.singleSpanSampler
    if ss != nil && ss.enabled() {
        ss.applySpanSamplingRules(pt, chunks)
    }
    // Analytics events fallback
    ...
}

The !a.conf.ErrorTrackingStandalone guard skips span-stripping for ETS. As a result, when ETS is enabled and a trace is dropped (keep=false), the spans remain in the chunk. Later in appendChunks, chunks are only removed if !keep && len(spans)==0. Since ETS traces still have their spans, they pass through and are forwarded with DroppedTrace=true in the protobuf.

The Rust implementation now explicitly replicates this: when ETS is enabled and the trace is dropped, apply_sampling_metadata is called with keep=false (which sets dropped_trace=true on the TraceSampling struct), and the trace is forwarded rather than discarded.

Please verify: Is this the correct read of the Go behavior? Specifically — is DroppedTrace=true actually set on these chunks in Go, and is the intent that dropped ETS traces are forwarded to the backend for stats/analytics purposes?

@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 9, 2026

@andrewqian2001datadog ready for another review here. I had some specific questions based on the Claude comment immediately above with the "ETS dropped trace forwarding" header. It seems like DDA currently forwards all traces (with DroppedTrace=True) when ETS is enabled, I wanted to see if that passed the sniff test for you.

Comment on lines +95 to +97
/// Enables Error Tracking Standalone mode. Lives here (rather than nested within `apm_config`)
/// so that the env var path (`DD_APM_ERROR_TRACKING_STANDALONE_ENABLED` → `apm_error_tracking_standalone_enabled`)
/// can be remapped via ConfigurationLoader::with_key_aliases.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that we also did this pattern with the rare sampler enabled field... falls right into that very narrow space between what key aliases and env remappings give us. 😭

/// Returns if error tracking standalone mode is enabled.
pub const fn error_tracking_standalone_enabled(&self) -> bool {
self.error_tracking_standalone.enabled
self.error_tracking_standalone
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually update the doc comment for this method to say:

Returns true if error tracking standalone mode is enabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pattern matching here, I'll have it update this for the other bool-returning methods here as well

Comment on lines +468 to +473
// ETS: forward dropped traces with DroppedTrace=true, suppressing SSS/analytics.
if self.error_tracking_standalone {
return false;
if let Some(root_idx) = root_span_idx {
self.apply_sampling_metadata(trace, false, priority, decision_maker, root_idx);
}
return true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain is a little mushy trying to think this one through...

On the Agent side, this method is equivalent to sample (here), where the boolean return value is whether or not to keep the trace.

Above this line, we check if keep is true and then return true if so... so if we're here, keep is false. Nowhere in sample is keep mutated after the call to a.traceSampling(now, ts, pt), so why do we return true even though we know keep is false? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return false here then the Trace is removed from the buffer and ultimately dropped. In ETS mode we need to forward all traces, and non-error traces get DroppedTrace=true metadata on them. Verified this behavior with the new correctness test added in this PR.

…cking Standalone mode

Adds a correctness test that sends OTLP traces to both the baseline (DDA) and
comparison (DDA+ADP) agents with ETS enabled, verifying that both forward the
same set of spans (error traces kept, non-error traces forwarded with DroppedTrace=true).

Uses a 10% error rate in the millstone corpus for meaningful error trace coverage,
and disables TPS limits to prevent the error sampler rate from being a variable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts dd-octo-sts bot added area/ci CI/CD, automated testing, etc. area/test All things testing: unit/integration, correctness, SMP regression, etc. labels Apr 10, 2026
Mirror DDA's OTLPReceiver.createChunks behavior: when the probabilistic
sampler is disabled, assign dm/priority based on trace ID sampling before
the ETS early return so non-error OTLP traces still carry the correct
`_dd.p.dm` and `_sampling_priority_v1` values.

Add DECISION_MAKER_MANUAL constant (-4) to common/datadog for user-set
sampling decisions, and unit test all five OTLP pre-sampling ETS paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thieman and others added 4 commits April 10, 2026 10:54
…e` if"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirror DDA more accurately: OTLPReceiver.createChunks runs before
runSamplersV1 entirely, so the dm/priority pre-assignment is not
inside the ETS branch. Move the otlp_pre_sample computation above
the ETS check; ETS consumes the result unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Moves the OTLP pre-sampling logic into a dedicated method and calls it
from inside the ETS block, keeping the computation co-located with its
only consumer and eliminating wasted work when ETS is disabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
/// or `None` if pre-sampling does not apply.
///
/// See: https://github.com/DataDog/datadog-agent/blob/be33ac1490c4a34602cbc65a211406b73ad6d00b/pkg/trace/api/otlp.go#L561-L585
fn otlp_pre_sample(&mut self, trace: &mut Trace, root_span_idx: usize) -> Option<(i32, &'static str)> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a missing piece, in DDA some metadata is applied on incoming OTLP traces before the samplers are run. The ETS branch in runSamplers needs that metadata. In ADP, that same metadata is applied after run_samplers is called, so we don't have it available here. This block allows us to source the OTLP-specific information we need without changing up the order of sampling vs applying metadata, which would cause us to waste cycles on otherwise-dropped traces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci CI/CD, automated testing, etc. area/components Sources, transforms, and destinations. area/test All things testing: unit/integration, correctness, SMP regression, etc. encoder/datadog-traces Datadog Traces encoder. transform/trace-sampler Trace Sampler synchronous transform.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants