Skip to content

fix(slo): Correctly report TOO_MANY_BYTES from allocation policy as rate-limited#7846

Merged
volokluev merged 1 commit intomasterfrom
volo/fix_slo_reporting
Mar 27, 2026
Merged

fix(slo): Correctly report TOO_MANY_BYTES from allocation policy as rate-limited#7846
volokluev merged 1 commit intomasterfrom
volo/fix_slo_reporting

Conversation

@volokluev
Copy link
Copy Markdown
Member

@volokluev volokluev commented Mar 27, 2026

Human summary

Our SnQL API supposedly has a very high error rate:

https://app.datadoghq.com/s/FH6-Y3/7s2-qbw-ap9

image

but these errors are not being captured in sentry. I dug into the querylog and noticed that almost all of them scanned too many bytes. This is a decision the BytesScannedRejectingPolicy can make to allow small queries through. It's surfaced as a rate limit to the caller but the metrics call it an error. This PR fixes it

Robot Summary

Fixes SLO reporting for TOO_MANY_BYTES errors that occur when queries exceed the max_bytes_to_read limit set by the BytesScannedRejectingPolicy allocation policy.

Problem

When the BytesScannedRejectingPolicy determines a customer is over their scan limit, it can set max_bytes_to_read on the ClickHouse query settings to allow smaller queries through instead of rejecting them outright. When a query exceeds this limit, ClickHouse raises a TOO_MANY_BYTES error (code 307).

Previously, this was being reported as a generic ERROR in SLO metrics, which counts against the SLO. However, this is actually rate limiting behavior and should be reported as RATE_LIMITED, which does not count against the SLO.

Changes

  1. Track allocation policy context (snuba/web/db_query.py:860)

    • Added max_bytes_to_read_set_by_policy flag to stats when the allocation policy sets max_bytes_to_read
  2. Update error handling (snuba/web/db_query.py:479-483)

    • Modified TOO_MANY_BYTES handling to only convert to RateLimitExceeded when the flag is set
    • Preserves generic error classification for TOO_MANY_BYTES from other sources
  3. Context-aware status determination (snuba/querylog/query_metadata.py:128-149)

    • Updated get_request_status() to accept optional context parameter
    • Checks for max_bytes_to_read_set_by_policy flag when handling TOO_MANY_BYTES errors
    • Returns RequestStatus.RATE_LIMITED only when the flag is present
  4. Comprehensive test coverage

    • Added unit tests for get_request_status() context handling
    • Added integration tests for both allocation policy and non-policy scenarios
    • Verifies correct SLO classification in both cases

Result

  • ✅ TOO_MANY_BYTES from allocation policy → RATE_LIMITED (SLO "for")
  • ✅ TOO_MANY_BYTES from other sources → ERROR (SLO "against")
  • ✅ Allocation policy rate limiting no longer incorrectly impacts SLO metrics

Test Plan

# Run unit tests
pytest tests/querylog/test_query_metadata.py -v

# Run API status tests
pytest tests/test_api_status.py::TestApiCodes::test_too_many_bytes_from_allocation_policy -v
pytest tests/test_api_status.py::TestApiCodes::test_too_many_bytes_without_allocation_policy -v

All tests pass ✅

…ate-limited

When the BytesScannedRejectingPolicy sets max_bytes_to_read to allow
smaller queries through, exceeding this limit triggers a TOO_MANY_BYTES
error from ClickHouse. Previously, this was incorrectly reported as a
generic ERROR in SLO metrics, counting against the SLO.

This change:
- Tracks when max_bytes_to_read is set by allocation policy via a flag
- Updates get_request_status to accept context and check the flag
- Only classifies TOO_MANY_BYTES as RATE_LIMITED when set by policy
- Adds comprehensive test coverage for both scenarios

Result: Allocation policy rate limiting no longer incorrectly impacts SLO.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@volokluev volokluev marked this pull request as ready for review March 27, 2026 17:59
@volokluev volokluev requested a review from a team as a code owner March 27, 2026 17:59
@volokluev volokluev merged commit e25b8b3 into master Mar 27, 2026
46 of 47 checks passed
@volokluev volokluev deleted the volo/fix_slo_reporting branch March 27, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants