fix(ddsketch): correct trim_left bin overflow and handle oversized metrics in encoder by thieman · Pull Request #1305 · DataDog/saluki

thieman · 2026-04-02T19:32:41Z

Result of an investigation into #1119. Closes #1119.

To answer the original question from that issue: the correctness tests do not exercise this distribution case because it is only triggered when the provided sample rate is less than the supported minimum of 0.000000003845 (source). This never happens in the correctness test because Lading defaults the sampling range to a uniform distribution in the range [0.1, 1.0]. When I created a new correctness test which forced the sampling rate to below the minimum, I did observe the expected deviation in DDA and Saluki results. However, I also observed a panic and a logic error, both of which are fixed by this PR.

This PR fixes and provides regression tests for these problems, described below. It does not add any new correctness tests, as we confirmed that sketch behavior at very low sample rates is divergent. After speaking with @tobz we agreed to keep the Saluki behavior, which enforces a limit on maximum bin count, in place as it aligns with our larger goals around bounding.

Change Summary

Two related bugs triggered by distributions with very large per-sample weights (weight > u16::MAX = 65535, i.e. sample rates below ~1.5e-5):

Bug 1 — trim_left bin count explosion (lib/ddsketch)

When a single insert_n call generates multiple bins for the same key (because the weight exceeds MAX_BIN_WIDTH = 65535), the collapse loop in trim_left accumulated overflow bins into the output size budget via overflow.truncate(bin_limit + overflow_len). This prevented any actual reduction: the bin array grew without bound across insertions rather than staying at DDSKETCH_CONF_BIN_LIMIT (4096). The fix truncates to bin_limit unconditionally.

Bug 2 — encoder panic on oversized sketch (lib/saluki-components)

The unbounded sketch from bug 1 serializes to an uncompressed encoded size larger than the encoder's 3.2 MiB compressed payload limit. encode_inner detected this via would_write_exceed_threshold's early len > threshold guard and returned Ok(false) — the signal to flush and retry. But on an empty builder, flush() returns an empty vec, hitting the panic in run_request_builder: "builder told us to flush, but gave us nothing". The fix adds InputExceedsCompressedSizeLimit to RequestBuilderError and returns it from encode_inner when the single metric's encoded size already exceeds the compressed limit, so the caller drops the metric via the existing error path rather than looping forever.

Test plan

trim_left_respects_bin_limit_with_large_weights — inserts 10 values at weight 260,078,024 (ADP's minimum safe sample rate), asserts bin count never exceeds 4096; fails after the 2nd insert without the fix (bins = 7938)
input_exceeds_compressed_size_limit — encodes an input larger than the configured compressed limit into an empty builder, asserts InputExceedsCompressedSizeLimit is returned (not a panic); confirms builder remains usable after the error
Manually reproduced original panic via correctness test harness with sample rate 1e-9; confirmed no crash with both fixes applied

🤖 Generated with Claude Code

…trics in encoder Two related bugs triggered by distributions with very large per-sample weights (weight > u16::MAX = 65535, i.e. sample rates below ~1.5e-5): 1. `trim_left` in the agent DDSketch implementation used `overflow.truncate(bin_limit + overflow_len)` instead of `overflow.truncate(bin_limit)`. When a single `insert_n` call generates multiple bins for the same key (because the weight exceeds MAX_BIN_WIDTH), the collapse loop accumulates overflow bins that were added to the truncation threshold, preventing any actual reduction. The bin count grew without bound across insertions instead of being capped at DDSKETCH_CONF_BIN_LIMIT (4096). 2. The unbounded sketch eventually serialized to an uncompressed encoded size larger than the encoder's compressed payload limit (3.2 MiB). In `encode_inner`, `would_write_exceed_threshold` returned true via its early `len > threshold` guard, causing `encode()` to return `Ok(Some(metric))` (signal to flush and retry). But on an empty builder `flush()` returns an empty vec, hitting the panic in `run_request_builder`: "builder told us to flush, but gave us nothing". Fixes: - `trim_left`: truncate to `bin_limit` unconditionally so the result is always within the configured limit, accepting the existing precision trade-off for large-weight sketches. - `encode_inner`: when a single metric's encoded size exceeds the compressed payload limit it can never fit regardless of payload state; return `Err(InputExceedsCompressedSizeLimit)` so the caller drops the metric rather than entering an unresolvable flush loop. Adds regression tests for both bugs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…omment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tobz · 2026-04-03T00:12:59Z

lib/saluki-components/src/common/datadog/request_builder.rs

+        // If the input's encoded size already exceeds the compressed payload limit, it can never fit regardless of
+        // what else is in the payload, so there is no point in asking the caller to flush and retry. Return an error
+        // so the input is dropped rather than looping forever.
+        if self.scratch_buf.len() > self.compressed_len_limit {
+            return Err(RequestBuilderError::InputExceedsCompressedSizeLimit {
+                encoded_len: self.scratch_buf.len(),
+                compressed_len_limit: self.compressed_len_limit,
+            });
+        }


This actually... isn't right, now that I think about it.

The whole thing is that we really can't know if we've exceed the compressed size limit or not until we actually attempt compressing the request payload. I think what's going on, more likely than not, is that we're triggering the "can't encode this metric" pathway (Ok(false)) because CompressionEstimator::would_write_exceed_threshold returns true: naively, it will end up checking if encoded_len is greater than self.compressed_len_limit, and return true if so. That's why your "fix" appears to work, because it's preventing us from reaching that faulty logic.

I think we would actually just want to remove the conditional I mentioned from CompressionEstimator::would_write_exceed_threshold since, realistically, it shouldn't be able to estimate overflow without any compressed data have been written out yet. I don't know why I added that check initially but it's feeling wrong in hindsight.

As a further thought, any sort of pathological metric like this would cause problems because even if we allow it in, and then detect an oversized payload during flush and so we split and try to do the two separate halves... the half with the pathological metric will still fail to produce a compressed payload that is within the limit.

I'll take another look, I'm wondering whether we actually need to bail this out ahead of time or whether we could instead just not panic if we have a buffer with size 1 and still can't flush

I'm also curious whether we expect any legit payloads to ever actually be this big (>3MB)?

Fixing the compression estimator is definitely something we should do: its purpose is to act purely as an optimization hint to flush before we grow payloads so large that they need to be split... but in this case, it's causing a logical error (instructing the caller to flush when there's no actual payload) so that's just straight-up wrong.

If we do that, then what will happen is that we'll keep letting encoded metrics through until the compressor finally does an intermediate flush (because it filled its own internal buffers) and then we'll quickly realize whether or not the current compressed size is over the limit or not... and if not, then the actual estimator logic will kick in. Finally, the check during flush protects us from oversized payloads.

I'm also curious whether we expect any legit payloads to ever actually be this big (>3MB)?

Well, in the case of our weird Go tracer behavior reporting allocation statistics with crazy high sample rates... yes? That's uncompressed, though. All of those nearly identical bins should then compress down fairly well.

With the v3 intake protocol (#1175), they'll compress down really well because of the delta encoding that we do.

Those don't get so large once the trimming is actually working though, right? Shouldn't they be in the 10s of KBs?

tobz · 2026-04-03T00:15:59Z

lib/ddsketch/src/agent/sketch.rs

+    // Truncate to bin_limit so the total bin count stays within the configured limit. Overflow bins created
+    // above (when collapsed counts exceed MAX_BIN_WIDTH) are prepended before bins_end, and together the
+    // combined slice is capped at bin_limit. This may discard some higher-key bins from bins_end when
+    // overflow is large, which is the expected precision trade-off for a bounded sketch.


Technically, I think we actually want to trim the lower bins, since lowest-collapsing is the default behavior for the official DDSketch implementation, IIRC: it means you end up trading accuracy at lower percentiles where it matters far less.

As is, we would be harming the upper percentiles: p95, p99, yadda yadda.

Ah yeah truncate keeps the earlier ones, so I guess this wasn't trimming left to begin with. Will fix

The accumulation loop in trim_left was pushing overflow bins using bin.k (the removed bin's key) when running mass exceeded MAX_BIN_WIDTH. Per CollapsingLowestDenseStore in sketches-go, all removed mass must collapse into the first kept bin's index. This meant the old code could leave bins with removed keys in the output. Fix: accumulate missing in the loop without emitting intermediate bins, then fold everything into bins[num_to_remove] and pass first_kept_k to generate_bins for any u16 overflow. Adds four unit tests that verify exact bin counts and keys after collapsing, ported from the sketches-go collapsingLowest semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous test only checked bin keys, which passed with the old buggy code because wrong-key overflow bins were always drained from the front. The actual observable difference is count: the old loop subtracted MAX_BIN_WIDTH from `missing` into a wrong-key bin, leaving the first kept bin with a lower n than it should have. New test asserts the first kept bin is saturated at MAX_BIN_WIDTH (65535) after collapsing two 50000-count bins, which fails with the old code (got 34466) and passes with the fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…collapse With u16 (max 65535), a single insert_n call with weight ~260M generated ~3969 bins per key. Collapsing those bins required multiple generate_bins iterations, and when the result exceeded bin_limit the drain discarded real observations — up to 34466 lost per trim_left invocation in practice. Widening n to u32 (max ~4.3B) means any per-key count up to that ceiling fits in a single bin, so collapse never loses mass for realistic workloads. The ADP weight case (260_078_024) now produces exactly one bin per key. Changes: - Bin.n: u16 → u32; MAX_BIN_WIDTH: u16::MAX → u32::MAX in both bin.rs and sketch.rs - generate_bins: cast targets updated to u32 - merge_to_dogsketch: u32::from(bin.n) simplified to bin.n (already u32) - TryFrom<Dogsketch>: removed now-redundant u16::try_from for bin count - insert_raw_bin: n parameter u16 → u32 - Allocation tests: updated expected byte counts to reflect Bin size growing from 4 to 8 bytes (i16 + 2 padding + u32) - Unit tests: trim_left_overflow_maximizes_mass_in_first_kept_bin replaced with trim_left_preserves_exact_count_with_u32_bins, which now verifies that the 100001-count collapse is fully preserved (no saturation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ed on first write CompressionEstimator::would_write_exceed_threshold had an early exit: if len > threshold { return true; } When a pathological metric (e.g. a sketch with many bins from a very low sample rate) was the very first input encoded, its uncompressed encoded length could exceed the configured compressed size limit. The early exit fired before any compressed data had been written, signalling encode_inner to flush — but with an empty builder, flush() returned nothing, hitting the panic: "builder told us to flush, but gave us nothing" The early exit is wrong: uncompressed size > compressed limit says nothing about whether the compressed output will exceed the limit (many sketches with near-identical bins compress very well). The estimator's job is to act as an optimization hint, not a hard gate that can fire before the first byte has been compressed. Removing the check means the existing total_compressed_len == 0 guard is the first thing evaluated, returning false and allowing the write to proceed. A genuinely oversized payload will be caught during flush via the existing PayloadTooLarge path instead of panicking. Fixes the panic reported in #1119 / discussed in #1305. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

thieman · 2026-04-03T17:16:35Z

Reworking this in #1307 and #1310

@tobz

…ck that panics on first write (#1310) ## Problem When a pathological metric (e.g. a sketch with many bins generated by an extremely low sample rate) was the very first input encoded into a request builder, the agent would panic: ``` builder told us to flush, but gave us nothing ``` **Root cause**: `CompressionEstimator::would_write_exceed_threshold` had an early exit: ```rust if len > threshold { return true; } ``` If the metric's uncompressed encoded size exceeded the configured *compressed* size limit, this guard fired and returned `true` — signalling that a flush was needed — even though nothing had been written to the compressor yet. `encode_inner` returned `Ok(false)` (flush signal), `flush()` returned an empty vec, and the caller panicked. The check is logically wrong: uncompressed size being larger than the compressed limit says nothing about whether the *compressed* output will fit. Sketches with thousands of near-identical bins (the exact pathological case) compress extremely well. The estimator's purpose is to be an optimization hint, not a hard gate. ## Fix Remove the early `len > threshold` exit from `would_write_exceed_threshold`. The existing `total_compressed_len == 0` guard (which already returns `false` when no data has been written) is sufficient — and is now the first thing evaluated. If the payload genuinely is too large after compression, the backstop is [`flush()` at the `compressed_len > compressed_limit` check](https://github.com/DataDog/saluki/blob/cf17a0cd17/lib/saluki-components/src/common/datadog/request_builder.rs#L399): for a single oversized metric it clears the input and returns `Err(PayloadTooLarge)`, which the encoder caller logs and drops rather than panicking. This matches the approach @tobz suggested in #1305. ## Test plan - [x] `first_input_larger_than_compressed_limit_does_not_panic` — creates a builder with a compressed limit (1 KB) smaller than the input (10 KB of repeated chars), encodes the large input as the very first metric, and asserts no panic - [x] Updated `compression_estimator_no_output` unit test — now asserts `false` for both cases when no compressed data has been written, reflecting the new semantics - [x] All existing request builder and compression estimator tests pass Closes #1119. See also #1305. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: travis.thieman <travis.thieman@datadoghq.com>

@tobz

…ck that panics on first write (#1310) ## Problem When a pathological metric (e.g. a sketch with many bins generated by an extremely low sample rate) was the very first input encoded into a request builder, the agent would panic: ``` builder told us to flush, but gave us nothing ``` **Root cause**: `CompressionEstimator::would_write_exceed_threshold` had an early exit: ```rust if len > threshold { return true; } ``` If the metric's uncompressed encoded size exceeded the configured *compressed* size limit, this guard fired and returned `true` — signalling that a flush was needed — even though nothing had been written to the compressor yet. `encode_inner` returned `Ok(false)` (flush signal), `flush()` returned an empty vec, and the caller panicked. The check is logically wrong: uncompressed size being larger than the compressed limit says nothing about whether the *compressed* output will fit. Sketches with thousands of near-identical bins (the exact pathological case) compress extremely well. The estimator's purpose is to be an optimization hint, not a hard gate. ## Fix Remove the early `len > threshold` exit from `would_write_exceed_threshold`. The existing `total_compressed_len == 0` guard (which already returns `false` when no data has been written) is sufficient — and is now the first thing evaluated. If the payload genuinely is too large after compression, the backstop is [`flush()` at the `compressed_len > compressed_limit` check](https://github.com/DataDog/saluki/blob/cf17a0cd17/lib/saluki-components/src/common/datadog/request_builder.rs#L399): for a single oversized metric it clears the input and returns `Err(PayloadTooLarge)`, which the encoder caller logs and drops rather than panicking. This matches the approach @tobz suggested in #1305. ## Test plan - [x] `first_input_larger_than_compressed_limit_does_not_panic` — creates a builder with a compressed limit (1 KB) smaller than the input (10 KB of repeated chars), encodes the large input as the very first metric, and asserts no panic - [x] Updated `compression_estimator_no_output` unit test — now asserts `false` for both cases when no compressed data has been written, reflecting the new semantics - [x] All existing request builder and compression estimator tests pass Closes #1119. See also #1305. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: travis.thieman <travis.thieman@datadoghq.com> c8eb887

dd-octo-sts bot added area/core Core functionality, event model, etc. area/components Sources, transforms, and destinations. labels Apr 2, 2026

docs(ddsketch): note intentional divergence from Agent in trim_left c…

9456e8c

…omment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

thieman marked this pull request as ready for review April 2, 2026 19:46

thieman requested a review from a team as a code owner April 2, 2026 19:46

tobz reviewed Apr 3, 2026

View reviewed changes

thieman and others added 3 commits April 3, 2026 10:24

thieman mentioned this pull request Apr 3, 2026

fix(datadog encoder): remove false-positive compression estimator check that panics on first write #1310

Merged

3 tasks

thieman closed this Apr 3, 2026

thieman mentioned this pull request Apr 3, 2026

Investigate why correctness tests don't seem to exercise sample rate handling difference between Agent and ADP for DogStatsD. #1119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ddsketch): correct trim_left bin overflow and handle oversized metrics in encoder#1305

fix(ddsketch): correct trim_left bin overflow and handle oversized metrics in encoder#1305
thieman wants to merge 5 commits intomainfrom
thieman/fix-ddsketch-trim-left-and-encoder-panic

thieman commented Apr 2, 2026 •

edited

Loading

Uh oh!

tobz Apr 3, 2026

Uh oh!

tobz Apr 3, 2026

Uh oh!

thieman Apr 3, 2026

Uh oh!

thieman Apr 3, 2026

Uh oh!

tobz Apr 3, 2026

Uh oh!

tobz Apr 3, 2026

Uh oh!

thieman Apr 3, 2026

Uh oh!

tobz Apr 3, 2026

Uh oh!

thieman Apr 3, 2026

Uh oh!

thieman commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thieman commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Summary

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thieman commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thieman commented Apr 2, 2026 •

edited

Loading