Add rate‑limited log warning when BatchSpanProcessor drops spans#8167
Conversation
| if (!queue.offer(span)) { | ||
| spanProcessorInstrumentation.dropSpans(1); | ||
| throttledLogger.log( | ||
| Level.WARNING, |
There was a problem hiding this comment.
Isn't warning too high level? I think INFO would be enough.
There was a problem hiding this comment.
Good point. I think WARNING is appropriate here because dropped spans represent actual data loss, which impacts observability. This warrants immediate attention from operators in production. Since we're using ThrottlingLogger, the output is already rate-limited to prevent log spam.
However, if the team prefers INFO, I'm happy to adjust. Please let me know.
| ComponentId.generateLazy("batching_span_processor"); | ||
|
|
||
| private static final Logger logger = Logger.getLogger(BatchSpanProcessor.class.getName()); | ||
| private static final ThrottlingLogger throttledLogger = new ThrottlingLogger(logger); |
There was a problem hiding this comment.
throttlingLogger would be a better name.
There was a problem hiding this comment.
Good catch! I've updated the variable name from throttledLogger to throttlingLogger to better reflect that this logger performs throttling (matching the active form of the class name ThrottlingLogger).
jack-berg
left a comment
There was a problem hiding this comment.
Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?
More generally, do we want to duplicate signals? If so, when?
We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?
cc @jkwatson
I think we've been assuming that everyone will at least have metrics in their pipeline, but perhaps that's a bad assumption. A throttled log for this seems non-terrible to me, but I don't really have a strong opinion one way or another. Perhaps a simpler way would be to just log if we dropped spans at the end of the export (or do we already do that?) |
Thanks for the feedback, @jack-berg and @jkwatson! I'd argue the log isn't really a duplication, it serves different audience. The internal metrics via This also follows established patterns within the SDK. Regarding the suggestion to log at the end of the export, the drop happens at The use of |
|
I think my (and perhaps Jack's) point was not that the ThrottlingLogger wasn't a good option, but that it might be more important to know how many spans got dropped, which this approach will not let you know, since the logs will only be emitted for the first few. It would let you know that spans were being dropped, but not how bad the problem is. If a solution could be put together to log the total number dropped per batch (which would certainly require more work!), that would probably be a better solution. |
That's a great point, @jkwatson. I see what you mean now. I think the per-batch drop count approach you're suggesting is a better solution. The implementation would be fairly straightforward: replace the
This is naturally throttled by the export cycle, gives visibility into how severe the backpressure is, and the threading model already supports it cleanly since I'll update the PR with this approach. Let me know if that aligns with what you had in mind. |
…anProcessor Instead of logging a throttled message per dropped span, accumulate drop counts using an AtomicInteger and report the total number of dropped spans once per export cycle in exportCurrentBatch(). This gives users visibility into the magnitude of span drops, not just that drops are occurring. The counter is incremented atomically in addSpan() (called from application threads) and read/reset with getAndSet(0) in exportCurrentBatch() (called from the worker thread).
|
@jkwatson
This naturally limits itself to one log per batch and gives a clear picture of how much backpressure there is. Let me know what you think! |
| spanProcessorInstrumentation.buildQueueMetricsOnce(maxQueueSize, queue::size); | ||
| if (!queue.offer(span)) { | ||
| spanProcessorInstrumentation.dropSpans(1); | ||
| droppedSpanCount.incrementAndGet(); |
There was a problem hiding this comment.
@jack-berg do you think it would make sense to add something to the SpanProcessorInstrumentation to get this information, rather than tracking it separately here?
/cc @anuraaga
There was a problem hiding this comment.
Would have to maintain state over there but at least it would be hidden behind a central abstraction
An otel maximalist answer here would be:
- Record the log using the otel log API (first solving Sketch out slf4j bridge #7905)
- Rely on temporal correlation between logs and metrics for the user to look up the metric state at the time the log was observed
I'm not trying to boil the ocean for this though 😛
There was a problem hiding this comment.
Yeah if the SpanProcessorInstrumentation was already using an async counter or such with tracking, just reading that would be fine. But it's probably better to keep that synchronous (I think?) so keeping this BSP implementation detail in BSP seems to make more sense
| BatchSpanProcessor.builder(mockSpanExporter) | ||
| .setMaxQueueSize(1) | ||
| .setMaxExportBatchSize(1) | ||
| .setScheduleDelay(Duration.ofMillis(1)) |
There was a problem hiding this comment.
Since we don't need any exports, may as well set to a huge number to reduce potential flakiness. And maybe it makes sense then to drop more than one span
| spanProcessorInstrumentation.buildQueueMetricsOnce(maxQueueSize, queue::size); | ||
| if (!queue.offer(span)) { | ||
| spanProcessorInstrumentation.dropSpans(1); | ||
| droppedSpanCount.incrementAndGet(); |
There was a problem hiding this comment.
Yeah if the SpanProcessorInstrumentation was already using an async counter or such with tracking, just reading that would be fine. But it's probably better to keep that synchronous (I think?) so keeping this BSP implementation detail in BSP seems to make more sense
| await() | ||
| .untilAsserted( | ||
| () -> | ||
| logs.assertContains( |
There was a problem hiding this comment.
Seems reasonable to assert the whole message, especially since it has some dynamic parts
Summary
When
BatchSpanProcessor's internal queue fills and spans are dropped, the SDK previously only emitted internal metrics (which can be missed if no metrics pipeline is configured). This PR adds a rate-limited WARNING log so users can detect span drops via logs.Key changes
ThrottlingLoggertoBatchSpanProcessorFixes
Fixes #7103