Skip to content

fix(ddsketch): correct trim_left bin overflow and handle oversized metrics in encoder#1305

Closed
thieman wants to merge 5 commits intomainfrom
thieman/fix-ddsketch-trim-left-and-encoder-panic
Closed

fix(ddsketch): correct trim_left bin overflow and handle oversized metrics in encoder#1305
thieman wants to merge 5 commits intomainfrom
thieman/fix-ddsketch-trim-left-and-encoder-panic

Conversation

@thieman
Copy link
Copy Markdown
Contributor

@thieman thieman commented Apr 2, 2026

Result of an investigation into #1119. Closes #1119.

To answer the original question from that issue: the correctness tests do not exercise this distribution case because it is only triggered when the provided sample rate is less than the supported minimum of 0.000000003845 (source). This never happens in the correctness test because Lading defaults the sampling range to a uniform distribution in the range [0.1, 1.0]. When I created a new correctness test which forced the sampling rate to below the minimum, I did observe the expected deviation in DDA and Saluki results. However, I also observed a panic and a logic error, both of which are fixed by this PR.

This PR fixes and provides regression tests for these problems, described below. It does not add any new correctness tests, as we confirmed that sketch behavior at very low sample rates is divergent. After speaking with @tobz we agreed to keep the Saluki behavior, which enforces a limit on maximum bin count, in place as it aligns with our larger goals around bounding.

Change Summary

Two related bugs triggered by distributions with very large per-sample weights (weight > u16::MAX = 65535, i.e. sample rates below ~1.5e-5):

Bug 1 — trim_left bin count explosion (lib/ddsketch)

When a single insert_n call generates multiple bins for the same key (because the weight exceeds MAX_BIN_WIDTH = 65535), the collapse loop in trim_left accumulated overflow bins into the output size budget via overflow.truncate(bin_limit + overflow_len). This prevented any actual reduction: the bin array grew without bound across insertions rather than staying at DDSKETCH_CONF_BIN_LIMIT (4096). The fix truncates to bin_limit unconditionally.

Bug 2 — encoder panic on oversized sketch (lib/saluki-components)

The unbounded sketch from bug 1 serializes to an uncompressed encoded size larger than the encoder's 3.2 MiB compressed payload limit. encode_inner detected this via would_write_exceed_threshold's early len > threshold guard and returned Ok(false) — the signal to flush and retry. But on an empty builder, flush() returns an empty vec, hitting the panic in run_request_builder: "builder told us to flush, but gave us nothing". The fix adds InputExceedsCompressedSizeLimit to RequestBuilderError and returns it from encode_inner when the single metric's encoded size already exceeds the compressed limit, so the caller drops the metric via the existing error path rather than looping forever.

Test plan

  • trim_left_respects_bin_limit_with_large_weights — inserts 10 values at weight 260,078,024 (ADP's minimum safe sample rate), asserts bin count never exceeds 4096; fails after the 2nd insert without the fix (bins = 7938)
  • input_exceeds_compressed_size_limit — encodes an input larger than the configured compressed limit into an empty builder, asserts InputExceedsCompressedSizeLimit is returned (not a panic); confirms builder remains usable after the error
  • Manually reproduced original panic via correctness test harness with sample rate 1e-9; confirmed no crash with both fixes applied

🤖 Generated with Claude Code

…trics in encoder

Two related bugs triggered by distributions with very large per-sample weights
(weight > u16::MAX = 65535, i.e. sample rates below ~1.5e-5):

1. `trim_left` in the agent DDSketch implementation used
   `overflow.truncate(bin_limit + overflow_len)` instead of
   `overflow.truncate(bin_limit)`. When a single `insert_n` call generates
   multiple bins for the same key (because the weight exceeds MAX_BIN_WIDTH),
   the collapse loop accumulates overflow bins that were added to the truncation
   threshold, preventing any actual reduction. The bin count grew without bound
   across insertions instead of being capped at DDSKETCH_CONF_BIN_LIMIT (4096).

2. The unbounded sketch eventually serialized to an uncompressed encoded size
   larger than the encoder's compressed payload limit (3.2 MiB). In
   `encode_inner`, `would_write_exceed_threshold` returned true via its early
   `len > threshold` guard, causing `encode()` to return `Ok(Some(metric))`
   (signal to flush and retry). But on an empty builder `flush()` returns an
   empty vec, hitting the panic in `run_request_builder`:
   "builder told us to flush, but gave us nothing".

Fixes:
- `trim_left`: truncate to `bin_limit` unconditionally so the result is always
  within the configured limit, accepting the existing precision trade-off for
  large-weight sketches.
- `encode_inner`: when a single metric's encoded size exceeds the compressed
  payload limit it can never fit regardless of payload state; return
  `Err(InputExceedsCompressedSizeLimit)` so the caller drops the metric rather
  than entering an unresolvable flush loop.

Adds regression tests for both bugs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts dd-octo-sts bot added area/core Core functionality, event model, etc. area/components Sources, transforms, and destinations. labels Apr 2, 2026
…omment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman thieman marked this pull request as ready for review April 2, 2026 19:46
@thieman thieman requested a review from a team as a code owner April 2, 2026 19:46
Comment on lines +345 to +353
// If the input's encoded size already exceeds the compressed payload limit, it can never fit regardless of
// what else is in the payload, so there is no point in asking the caller to flush and retry. Return an error
// so the input is dropped rather than looping forever.
if self.scratch_buf.len() > self.compressed_len_limit {
return Err(RequestBuilderError::InputExceedsCompressedSizeLimit {
encoded_len: self.scratch_buf.len(),
compressed_len_limit: self.compressed_len_limit,
});
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually... isn't right, now that I think about it.

The whole thing is that we really can't know if we've exceed the compressed size limit or not until we actually attempt compressing the request payload. I think what's going on, more likely than not, is that we're triggering the "can't encode this metric" pathway (Ok(false)) because CompressionEstimator::would_write_exceed_threshold returns true: naively, it will end up checking if encoded_len is greater than self.compressed_len_limit, and return true if so. That's why your "fix" appears to work, because it's preventing us from reaching that faulty logic.

I think we would actually just want to remove the conditional I mentioned from CompressionEstimator::would_write_exceed_threshold since, realistically, it shouldn't be able to estimate overflow without any compressed data have been written out yet. I don't know why I added that check initially but it's feeling wrong in hindsight.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a further thought, any sort of pathological metric like this would cause problems because even if we allow it in, and then detect an oversized payload during flush and so we split and try to do the two separate halves... the half with the pathological metric will still fail to produce a compressed payload that is within the limit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another look, I'm wondering whether we actually need to bail this out ahead of time or whether we could instead just not panic if we have a buffer with size 1 and still can't flush

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious whether we expect any legit payloads to ever actually be this big (>3MB)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the compression estimator is definitely something we should do: its purpose is to act purely as an optimization hint to flush before we grow payloads so large that they need to be split... but in this case, it's causing a logical error (instructing the caller to flush when there's no actual payload) so that's just straight-up wrong.

If we do that, then what will happen is that we'll keep letting encoded metrics through until the compressor finally does an intermediate flush (because it filled its own internal buffers) and then we'll quickly realize whether or not the current compressed size is over the limit or not... and if not, then the actual estimator logic will kick in. Finally, the check during flush protects us from oversized payloads.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious whether we expect any legit payloads to ever actually be this big (>3MB)?

Well, in the case of our weird Go tracer behavior reporting allocation statistics with crazy high sample rates... yes? That's uncompressed, though. All of those nearly identical bins should then compress down fairly well.

With the v3 intake protocol (#1175), they'll compress down really well because of the delta encoding that we do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those don't get so large once the trimming is actually working though, right? Shouldn't they be in the 10s of KBs?

// Truncate to bin_limit so the total bin count stays within the configured limit. Overflow bins created
// above (when collapsed counts exceed MAX_BIN_WIDTH) are prepended before bins_end, and together the
// combined slice is capped at bin_limit. This may discard some higher-key bins from bins_end when
// overflow is large, which is the expected precision trade-off for a bounded sketch.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I think we actually want to trim the lower bins, since lowest-collapsing is the default behavior for the official DDSketch implementation, IIRC: it means you end up trading accuracy at lower percentiles where it matters far less.

As is, we would be harming the upper percentiles: p95, p99, yadda yadda.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah truncate keeps the earlier ones, so I guess this wasn't trimming left to begin with. Will fix

thieman and others added 3 commits April 3, 2026 10:24
The accumulation loop in trim_left was pushing overflow bins using
bin.k (the removed bin's key) when running mass exceeded MAX_BIN_WIDTH.
Per CollapsingLowestDenseStore in sketches-go, all removed mass must
collapse into the first kept bin's index. This meant the old code could
leave bins with removed keys in the output.

Fix: accumulate missing in the loop without emitting intermediate bins,
then fold everything into bins[num_to_remove] and pass first_kept_k to
generate_bins for any u16 overflow.

Adds four unit tests that verify exact bin counts and keys after
collapsing, ported from the sketches-go collapsingLowest semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous test only checked bin keys, which passed with the old buggy
code because wrong-key overflow bins were always drained from the front.
The actual observable difference is count: the old loop subtracted
MAX_BIN_WIDTH from `missing` into a wrong-key bin, leaving the first kept
bin with a lower n than it should have.

New test asserts the first kept bin is saturated at MAX_BIN_WIDTH (65535)
after collapsing two 50000-count bins, which fails with the old code (got
34466) and passes with the fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…collapse

With u16 (max 65535), a single insert_n call with weight ~260M generated
~3969 bins per key. Collapsing those bins required multiple generate_bins
iterations, and when the result exceeded bin_limit the drain discarded
real observations — up to 34466 lost per trim_left invocation in practice.

Widening n to u32 (max ~4.3B) means any per-key count up to that ceiling
fits in a single bin, so collapse never loses mass for realistic workloads.
The ADP weight case (260_078_024) now produces exactly one bin per key.

Changes:
- Bin.n: u16 → u32; MAX_BIN_WIDTH: u16::MAX → u32::MAX in both bin.rs
  and sketch.rs
- generate_bins: cast targets updated to u32
- merge_to_dogsketch: u32::from(bin.n) simplified to bin.n (already u32)
- TryFrom<Dogsketch>: removed now-redundant u16::try_from for bin count
- insert_raw_bin: n parameter u16 → u32
- Allocation tests: updated expected byte counts to reflect Bin size
  growing from 4 to 8 bytes (i16 + 2 padding + u32)
- Unit tests: trim_left_overflow_maximizes_mass_in_first_kept_bin replaced
  with trim_left_preserves_exact_count_with_u32_bins, which now verifies
  that the 100001-count collapse is fully preserved (no saturation)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
thieman added a commit that referenced this pull request Apr 3, 2026
…ed on first write

CompressionEstimator::would_write_exceed_threshold had an early exit:
  if len > threshold { return true; }

When a pathological metric (e.g. a sketch with many bins from a very
low sample rate) was the very first input encoded, its uncompressed
encoded length could exceed the configured compressed size limit. The
early exit fired before any compressed data had been written, signalling
encode_inner to flush — but with an empty builder, flush() returned
nothing, hitting the panic:
  "builder told us to flush, but gave us nothing"

The early exit is wrong: uncompressed size > compressed limit says
nothing about whether the compressed output will exceed the limit (many
sketches with near-identical bins compress very well). The estimator's
job is to act as an optimization hint, not a hard gate that can fire
before the first byte has been compressed.

Removing the check means the existing total_compressed_len == 0 guard
is the first thing evaluated, returning false and allowing the write to
proceed. A genuinely oversized payload will be caught during flush via
the existing PayloadTooLarge path instead of panicking.

Fixes the panic reported in #1119 / discussed in #1305.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thieman
Copy link
Copy Markdown
Contributor Author

thieman commented Apr 3, 2026

Reworking this in #1307 and #1310

@thieman thieman closed this Apr 3, 2026
gh-worker-dd-mergequeue-cf854d bot pushed a commit that referenced this pull request Apr 3, 2026
…ck that panics on first write (#1310)

## Problem

When a pathological metric (e.g. a sketch with many bins generated by an extremely low sample rate) was the very first input encoded into a request builder, the agent would panic:

```
builder told us to flush, but gave us nothing
```

**Root cause**: `CompressionEstimator::would_write_exceed_threshold` had an early exit:

```rust
if len > threshold {
    return true;
}
```

If the metric's uncompressed encoded size exceeded the configured *compressed* size limit, this guard fired and returned `true` — signalling that a flush was needed — even though nothing had been written to the compressor yet. `encode_inner` returned `Ok(false)` (flush signal), `flush()` returned an empty vec, and the caller panicked.

The check is logically wrong: uncompressed size being larger than the compressed limit says nothing about whether the *compressed* output will fit. Sketches with thousands of near-identical bins (the exact pathological case) compress extremely well. The estimator's purpose is to be an optimization hint, not a hard gate.

## Fix

Remove the early `len > threshold` exit from `would_write_exceed_threshold`. The existing `total_compressed_len == 0` guard (which already returns `false` when no data has been written) is sufficient — and is now the first thing evaluated.

If the payload genuinely is too large after compression, the backstop is [`flush()` at the `compressed_len > compressed_limit` check](https://github.com/DataDog/saluki/blob/cf17a0cd17/lib/saluki-components/src/common/datadog/request_builder.rs#L399): for a single oversized metric it clears the input and returns `Err(PayloadTooLarge)`, which the encoder caller logs and drops rather than panicking.

This matches the approach @tobz suggested in #1305.

## Test plan

- [x] `first_input_larger_than_compressed_limit_does_not_panic` — creates a builder with a compressed limit (1 KB) smaller than the input (10 KB of repeated chars), encodes the large input as the very first metric, and asserts no panic
- [x] Updated `compression_estimator_no_output` unit test — now asserts `false` for both cases when no compressed data has been written, reflecting the new semantics
- [x] All existing request builder and compression estimator tests pass

Closes #1119. See also #1305.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: travis.thieman <travis.thieman@datadoghq.com>
dd-octo-sts bot pushed a commit that referenced this pull request Apr 3, 2026
…ck that panics on first write (#1310)

## Problem

When a pathological metric (e.g. a sketch with many bins generated by an extremely low sample rate) was the very first input encoded into a request builder, the agent would panic:

```
builder told us to flush, but gave us nothing
```

**Root cause**: `CompressionEstimator::would_write_exceed_threshold` had an early exit:

```rust
if len > threshold {
    return true;
}
```

If the metric's uncompressed encoded size exceeded the configured *compressed* size limit, this guard fired and returned `true` — signalling that a flush was needed — even though nothing had been written to the compressor yet. `encode_inner` returned `Ok(false)` (flush signal), `flush()` returned an empty vec, and the caller panicked.

The check is logically wrong: uncompressed size being larger than the compressed limit says nothing about whether the *compressed* output will fit. Sketches with thousands of near-identical bins (the exact pathological case) compress extremely well. The estimator's purpose is to be an optimization hint, not a hard gate.

## Fix

Remove the early `len > threshold` exit from `would_write_exceed_threshold`. The existing `total_compressed_len == 0` guard (which already returns `false` when no data has been written) is sufficient — and is now the first thing evaluated.

If the payload genuinely is too large after compression, the backstop is [`flush()` at the `compressed_len > compressed_limit` check](https://github.com/DataDog/saluki/blob/cf17a0cd17/lib/saluki-components/src/common/datadog/request_builder.rs#L399): for a single oversized metric it clears the input and returns `Err(PayloadTooLarge)`, which the encoder caller logs and drops rather than panicking.

This matches the approach @tobz suggested in #1305.

## Test plan

- [x] `first_input_larger_than_compressed_limit_does_not_panic` — creates a builder with a compressed limit (1 KB) smaller than the input (10 KB of repeated chars), encodes the large input as the very first metric, and asserts no panic
- [x] Updated `compression_estimator_no_output` unit test — now asserts `false` for both cases when no compressed data has been written, reflecting the new semantics
- [x] All existing request builder and compression estimator tests pass

Closes #1119. See also #1305.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: travis.thieman <travis.thieman@datadoghq.com> c8eb887
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/core Core functionality, event model, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate why correctness tests don't seem to exercise sample rate handling difference between Agent and ADP for DogStatsD.

2 participants