enhancement(datadog encoder): support for metrics v3 protocol#1175
enhancement(datadog encoder): support for metrics v3 protocol#1175
Conversation
Binary Size Analysis (Agent Data Plane)Target: bdcdc6c (baseline) vs 16afa2e (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
saluki_components::encoders::datadog |
+73.02 KiB | 325 |
core |
+38.09 KiB | 9191 |
anyhow |
+9.05 KiB | 1302 |
hashbrown |
+7.45 KiB | 363 |
saluki_common::task::instrument |
+7.44 KiB | 70 |
[sections] |
+5.52 KiB | 9 |
[Unmapped] |
-4.81 KiB | 1 |
protobuf |
+4.74 KiB | 12 |
saluki_components::common::datadog |
-4.39 KiB | 319 |
http |
+3.85 KiB | 317 |
agent_data_plane::cli::run |
+3.11 KiB | 82 |
saluki_io::compression::Compressor<W> |
+2.49 KiB | 3 |
saluki_io::net::util |
+2.24 KiB | 130 |
serde_core |
+2.09 KiB | 351 |
uuid |
+1.85 KiB | 4 |
agent_data_plane::components::apm_onboarding |
-1.36 KiB | 34 |
saluki_components::transforms::dogstatsd_mapper |
-976 B | 18 |
saluki_components::forwarders::datadog |
+824 B | 22 |
saluki_components::forwarders::otlp |
-650 B | 60 |
serde_json |
+546 B | 175 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
[NEW] +1.79Mi [NEW] +1.79Mi std::thread::local::LocalKey<T>::with::hc4ab4ecb1e791a86
+1.1% +147Ki +1.2% +131Ki [22490 Others]
[NEW] +122Ki [NEW] +122Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h87eff169ee6c6a27
[NEW] +67.3Ki [NEW] +67.2Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::hc4516879514a2a08
[NEW] +62.1Ki [NEW] +62.0Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::ha5dcf4212e573f38
[NEW] +58.7Ki [NEW] +58.4Ki _<agent_data_plane::internal::control_plane::PrivilegedApiWorker as saluki_core::runtime::supervisor::Supervisable>::initialize::_{{closure}}::h30f35d64f65ba63d
[NEW] +49.6Ki [NEW] +49.4Ki saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::h5053651f0bda4aaa
[NEW] +47.4Ki [NEW] +47.3Ki moka::sync::base_cache::Inner<K,V,S>::do_run_pending_tasks::h43bb4bf0b6f2bc50
[NEW] +46.4Ki [NEW] +46.3Ki h2::proto::connection::Connection<T,P,B>::poll::h14f55357aa1a69bb
[NEW] +46.3Ki [NEW] +46.1Ki _<saluki_components::destinations::prometheus::Prometheus as saluki_core::components::destinations::Destination>::run::_{{closure}}::hba93477fbb085aa3
[NEW] +45.6Ki [NEW] +45.4Ki _<saluki_components::forwarders::otlp::OtlpForwarder as saluki_core::components::forwarders::Forwarder>::run::_{{closure}}::hfe9d1c75781916ba
[DEL] -46.2Ki [DEL] -46.0Ki _<saluki_components::forwarders::otlp::OtlpForwarder as saluki_core::components::forwarders::Forwarder>::run::_{{closure}}::h9e6a07604104748f
[DEL] -46.2Ki [DEL] -46.0Ki _<saluki_components::destinations::prometheus::Prometheus as saluki_core::components::destinations::Destination>::run::_{{closure}}::h804b0075faebe88b
[DEL] -46.4Ki [DEL] -46.3Ki h2::proto::connection::Connection<T,P,B>::poll::hded4fb5f8a002638
[DEL] -47.5Ki [DEL] -47.3Ki moka::sync::base_cache::Inner<K,V,S>::do_run_pending_tasks::heca9fca4e5bc0d83
[DEL] -49.6Ki [DEL] -49.4Ki saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::hd83d25f4c7e8a1ae
[DEL] -58.7Ki [DEL] -58.4Ki _<agent_data_plane::internal::control_plane::PrivilegedApiWorker as saluki_core::runtime::supervisor::Supervisable>::initialize::_{{closure}}::hf1e132bb3e8bfeaf
[DEL] -62.0Ki [DEL] -61.9Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::hbff4cb15be99fca3
[DEL] -64.2Ki [DEL] -64.1Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::h7847418cb5d4cb46
[DEL] -119Ki [DEL] -119Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h5c132aa500e2a433
[DEL] -1.79Mi [DEL] -1.79Mi std::thread::local::LocalKey<T>::with::hddbf5b47575f6048
+0.6% +152Ki +0.6% +137Ki TOTAL
Regression Detector (Agent Data Plane)Regression Detector ResultsRun ID: f239cfbd-066c-4ca0-8dd3-4a7ebe106213 Baseline: bdcdc6c Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | +1.40 | [-3.55, +6.34] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.02 | [-0.11, +0.14] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_memory | memory utilization | -4.20 | [-4.38, -4.01] | 1 | (metrics) (profiles) (logs) |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | dsd_uds_500mb_3k_contexts_throughput | ingress throughput | +2.22 | [+2.08, +2.35] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_cpu | % cpu utilization | +1.83 | [-0.71, +4.38] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | +1.40 | [-3.55, +6.34] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_cpu | % cpu utilization | +0.84 | [-1.39, +3.07] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_medium | memory utilization | +0.74 | [+0.55, +0.93] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_memory | memory utilization | +0.46 | [+0.21, +0.70] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_memory | memory utilization | +0.41 | [+0.16, +0.67] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_memory | memory utilization | +0.37 | [+0.19, +0.54] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_memory | memory utilization | +0.27 | [+0.08, +0.45] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_ultraheavy | memory utilization | +0.25 | [+0.13, +0.38] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_idle | memory utilization | +0.12 | [+0.08, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_memory | memory utilization | +0.11 | [-0.06, +0.27] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_memory | memory utilization | +0.03 | [-0.14, +0.20] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_throughput | ingress throughput | +0.03 | [-0.09, +0.14] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.02 | [-0.11, +0.14] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_throughput | ingress throughput | +0.01 | [-0.05, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_throughput | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_throughput | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_throughput | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.06, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.14, +0.13] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_throughput | ingress throughput | -0.01 | [-0.04, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_memory | memory utilization | -0.06 | [-0.23, +0.11] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_low | memory utilization | -0.10 | [-0.29, +0.09] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_cpu | % cpu utilization | -0.15 | [-1.60, +1.30] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_memory | memory utilization | -0.28 | [-0.61, +0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_heavy | memory utilization | -0.41 | [-0.55, -0.28] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_memory | memory utilization | -1.57 | [-1.79, -1.35] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_cpu | % cpu utilization | -2.41 | [-9.62, +4.79] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_cpu | % cpu utilization | -2.81 | [-8.98, +3.37] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_cpu | % cpu utilization | -2.91 | [-4.92, -0.91] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_cpu | % cpu utilization | -3.42 | [-59.97, +53.13] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_memory | memory utilization | -4.20 | [-4.38, -4.01] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_cpu | % cpu utilization | -4.36 | [-35.06, +26.33] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_cpu | % cpu utilization | -5.54 | [-58.31, +47.24] | 1 | (metrics) (profiles) (logs) |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gates_rss_dsd_heavy | memory_usage | 10/10 | 114.78MiB ≤ 140MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_low | memory_usage | 10/10 | 34.25MiB ≤ 50MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_medium | memory_usage | 10/10 | 54.66MiB ≤ 75MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_ultraheavy | memory_usage | 10/10 | 169.28MiB ≤ 200MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_idle | memory_usage | 10/10 | 21.48MiB ≤ 40MiB | (metrics) (profiles) (logs) |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
|
This is temporarily blocked on there being a version of the Datadog Agent for us to test against in correctness tests that has up-to-date v3 metrics support. Currently, we're hitting an issue related to rate intervals being delta encoded when they shouldn't be. That bug is fixed in DataDog/datadog-agent#45825 but won't be released until 7.77: roughly 2 weeks from now before an RC is available to use. We can potentially do a hacky image build or something for keep going in the meantime and then switch back to a proper Agent version once available, we'll see. |
79cdda1 to
59636cd
Compare
|
We've temporarily handled the issue of correctness tests by using a "dev" container image ( We can't merge this as-is: we need to wait for at least an RC build of Datadog Agent 7.77 so we can pin to a non-development image. In the meantime, I'm going to work on making sure we've integrated all of the same small fixes/changes that have been steadily being made upstream in the Datadog Agent repository for V3 support. |
30ee642 to
898021d
Compare
be9a81c to
a9f5109
Compare
a9f5109 to
31b5f82
Compare
lib/saluki-components/src/encoders/datadog/metrics/v3/writer.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/encoders/datadog/metrics/v3/writer.rs
Outdated
Show resolved
Hide resolved
| }); | ||
| let v3_flushed = if let Some(v3_metrics) = maybe_v3_metrics { | ||
| if v2_flushed || v3_metrics.len() >= v3_endpoint_config.max_metrics_per_payload() { | ||
| encode_and_flush_v3_metrics(endpoint, &v3_endpoint_config, v3_metrics, &telemetry, &mut payloads_tx, batch_id.as_ref(), v3_payload_info).await?; |
There was a problem hiding this comment.
This doesn't seem to observe any intake payload size limits, or am I missing anything?
There was a problem hiding this comment.
That's correct.
Right now, we're either flushing with the V2 encoder determines it needs to flush (so that we generate an equivalent payload in terms of the contained metrics between the two) or if we exceed the configured maximum metrics per payload limit.
In V2/V3 mode, I suppose it's entirely possible to have the V3 payload exceed the payload limits, although it would be incredibly unlikely. In V3 only mode, it's obviously a much more likely risk.
My thought process was that we would improve this -- make V3 encoding aware of the payload limits -- at the same time we added incremental compression to match the behavior of the Agent... since back when this was originally written many weeks ago, it seemed like we'd have enough time between then and "V3 only in production / for customers" to do the follow-up work.
I guess the question I have is: do you feel like we still have that sort of time before we want to be running V3 only?
| Ok(encoded) => { | ||
| match create_v3_request("/api/intake/metrics/v3/series", encoded, ep_config.compression_scheme()).await { | ||
| Ok(request) => { | ||
| flush_payload(request, events, payloads_tx, batch_id, 0, 1, payload_info).await?; |
There was a problem hiding this comment.
Is it intentional that batch_seq and batch_len are hard-coded as 0 and 1?
There was a problem hiding this comment.
It is intentional, but only in the context of us not currently splitting V3 payloads: there can literally only ever be a single V3 payload in each batch.
(Mostly related to the question you left about not obeying intake payload size limits.)
ca5ebc6 to
16afa2e
Compare

Summary
Change Type
How did you test this PR?
References