fix(datadog encoder): remove false-positive compression estimator check that panics on first write#1310
Conversation
…ed on first write
CompressionEstimator::would_write_exceed_threshold had an early exit:
if len > threshold { return true; }
When a pathological metric (e.g. a sketch with many bins from a very
low sample rate) was the very first input encoded, its uncompressed
encoded length could exceed the configured compressed size limit. The
early exit fired before any compressed data had been written, signalling
encode_inner to flush — but with an empty builder, flush() returned
nothing, hitting the panic:
"builder told us to flush, but gave us nothing"
The early exit is wrong: uncompressed size > compressed limit says
nothing about whether the compressed output will exceed the limit (many
sketches with near-identical bins compress very well). The estimator's
job is to act as an optimization hint, not a hard gate that can fire
before the first byte has been compressed.
Removing the check means the existing total_compressed_len == 0 guard
is the first thing evaluated, returning false and allowing the write to
proceed. A genuinely oversized payload will be caught during flush via
the existing PayloadTooLarge path instead of panicking.
Fixes the panic reported in #1119 / discussed in #1305.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Binary Size Analysis (Agent Data Plane)Target: b7aedc3 (baseline) vs 4651645 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
[Unmapped] |
+188 B | 1 |
saluki_components::common::datadog |
-174 B | 6 |
[sections] |
-14 B | 2 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
+4.1% +188 [ = ] 0 [Unmapped]
+0.0% +4 +0.0% +4 [section .gcc_except_table]
-0.1% -14 -0.1% -14 saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h9e42713475fa50b4
-0.4% -16 -0.4% -16 saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h6e9889bd0f773199
-0.0% -18 -0.0% -18 [section .text]
-0.3% -27 -0.3% -27 saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::hf5a628a3a0b06c9a
-0.2% -30 -0.2% -30 saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h9eedd2bd0e10ef6a
-0.2% -30 -0.2% -30 saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::hcd0b9ca9428d1499
-0.9% -57 -0.9% -57 saluki_components::common::datadog::request_builder::RequestBuilder<E>::encode_inner::_{{closure}}::h329de940b9ccc923
[ = ] 0 -0.0% -188 TOTAL
…_threshold Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regression Detector (Agent Data Plane)Regression Detector ResultsRun ID: d2fe7750-b3bd-4c4b-8fb3-30552e6fbedc Baseline: 592597c Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | otlp_ingest_logs_5mb_memory | memory utilization | +7.43 | [+6.89, +7.96] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | +0.44 | [-4.71, +5.58] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | -0.02 | [-0.14, +0.10] | 1 | (metrics) (profiles) (logs) |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | dsd_uds_512kb_3k_contexts_cpu | % cpu utilization | +21.05 | [-40.86, +82.97] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_cpu | % cpu utilization | +14.40 | [-43.18, +71.98] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_cpu | % cpu utilization | +11.71 | [-18.41, +41.84] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_cpu | % cpu utilization | +7.44 | [-1.29, +16.18] | 1 | (metrics) (profiles) (logs) |
| ❌ | otlp_ingest_logs_5mb_memory | memory utilization | +7.43 | [+6.89, +7.96] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_cpu | % cpu utilization | +2.79 | [+0.77, +4.80] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_throughput | ingress throughput | +0.72 | [+0.59, +0.85] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_cpu | % cpu utilization | +0.47 | [-0.95, +1.90] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | +0.44 | [-4.71, +5.58] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_memory | memory utilization | +0.43 | [+0.18, +0.68] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_medium | memory utilization | +0.41 | [+0.21, +0.60] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_memory | memory utilization | +0.38 | [+0.20, +0.57] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_idle | memory utilization | +0.30 | [+0.26, +0.35] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_memory | memory utilization | +0.18 | [-0.01, +0.36] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_low | memory utilization | +0.15 | [-0.05, +0.34] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_memory | memory utilization | +0.13 | [-0.04, +0.31] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_memory | memory utilization | +0.07 | [-0.27, +0.41] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_ultraheavy | memory utilization | +0.05 | [-0.08, +0.18] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_heavy | memory utilization | +0.03 | [-0.11, +0.17] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_throughput | ingress throughput | +0.01 | [-0.02, +0.04] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.05, +0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.05, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_throughput | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_memory | memory utilization | +0.00 | [-0.24, +0.24] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_throughput | ingress throughput | -0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.14, +0.13] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_throughput | ingress throughput | -0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_throughput | ingress throughput | -0.01 | [-0.14, +0.12] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | -0.02 | [-0.14, +0.10] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_memory | memory utilization | -0.26 | [-0.45, -0.08] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_memory | memory utilization | -0.37 | [-0.53, -0.20] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_cpu | % cpu utilization | -0.92 | [-3.28, +1.44] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_cpu | % cpu utilization | -1.94 | [-4.11, +0.24] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_memory | memory utilization | -2.19 | [-2.35, -2.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_cpu | % cpu utilization | -3.55 | [-9.28, +2.19] | 1 | (metrics) (profiles) (logs) |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gates_rss_dsd_heavy | memory_usage | 10/10 | 111.74MiB ≤ 140MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_low | memory_usage | 10/10 | 34.25MiB ≤ 50MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_medium | memory_usage | 10/10 | 53.11MiB ≤ 75MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_ultraheavy | memory_usage | 10/10 | 169.16MiB ≤ 200MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_idle | memory_usage | 10/10 | 20.90MiB ≤ 40MiB | (metrics) (profiles) (logs) |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
c8eb887
into
main
…ck that panics on first write (#1310) ## Problem When a pathological metric (e.g. a sketch with many bins generated by an extremely low sample rate) was the very first input encoded into a request builder, the agent would panic: ``` builder told us to flush, but gave us nothing ``` **Root cause**: `CompressionEstimator::would_write_exceed_threshold` had an early exit: ```rust if len > threshold { return true; } ``` If the metric's uncompressed encoded size exceeded the configured *compressed* size limit, this guard fired and returned `true` — signalling that a flush was needed — even though nothing had been written to the compressor yet. `encode_inner` returned `Ok(false)` (flush signal), `flush()` returned an empty vec, and the caller panicked. The check is logically wrong: uncompressed size being larger than the compressed limit says nothing about whether the *compressed* output will fit. Sketches with thousands of near-identical bins (the exact pathological case) compress extremely well. The estimator's purpose is to be an optimization hint, not a hard gate. ## Fix Remove the early `len > threshold` exit from `would_write_exceed_threshold`. The existing `total_compressed_len == 0` guard (which already returns `false` when no data has been written) is sufficient — and is now the first thing evaluated. If the payload genuinely is too large after compression, the backstop is [`flush()` at the `compressed_len > compressed_limit` check](https://github.com/DataDog/saluki/blob/cf17a0cd17/lib/saluki-components/src/common/datadog/request_builder.rs#L399): for a single oversized metric it clears the input and returns `Err(PayloadTooLarge)`, which the encoder caller logs and drops rather than panicking. This matches the approach @tobz suggested in #1305. ## Test plan - [x] `first_input_larger_than_compressed_limit_does_not_panic` — creates a builder with a compressed limit (1 KB) smaller than the input (10 KB of repeated chars), encodes the large input as the very first metric, and asserts no panic - [x] Updated `compression_estimator_no_output` unit test — now asserts `false` for both cases when no compressed data has been written, reflecting the new semantics - [x] All existing request builder and compression estimator tests pass Closes #1119. See also #1305. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: travis.thieman <travis.thieman@datadoghq.com> c8eb887
Problem
When a pathological metric (e.g. a sketch with many bins generated by an extremely low sample rate) was the very first input encoded into a request builder, the agent would panic:
Root cause:
CompressionEstimator::would_write_exceed_thresholdhad an early exit:If the metric's uncompressed encoded size exceeded the configured compressed size limit, this guard fired and returned
true— signalling that a flush was needed — even though nothing had been written to the compressor yet.encode_innerreturnedOk(false)(flush signal),flush()returned an empty vec, and the caller panicked.The check is logically wrong: uncompressed size being larger than the compressed limit says nothing about whether the compressed output will fit. Sketches with thousands of near-identical bins (the exact pathological case) compress extremely well. The estimator's purpose is to be an optimization hint, not a hard gate.
Fix
Remove the early
len > thresholdexit fromwould_write_exceed_threshold. The existingtotal_compressed_len == 0guard (which already returnsfalsewhen no data has been written) is sufficient — and is now the first thing evaluated.If the payload genuinely is too large after compression, the backstop is
flush()at thecompressed_len > compressed_limitcheck: for a single oversized metric it clears the input and returnsErr(PayloadTooLarge), which the encoder caller logs and drops rather than panicking.This matches the approach @tobz suggested in #1305.
Test plan
first_input_larger_than_compressed_limit_does_not_panic— creates a builder with a compressed limit (1 KB) smaller than the input (10 KB of repeated chars), encodes the large input as the very first metric, and asserts no paniccompression_estimator_no_outputunit test — now assertsfalsefor both cases when no compressed data has been written, reflecting the new semanticsCloses #1119. See also #1305.
🤖 Generated with Claude Code