-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Return percentiles metrics for batch size and event count within 1, 5, 15 minutes windows. #18503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ed to compute a flow metric.
…tored inside the RetentionWindow
…apshots of HdrHistograms
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label. Could you fix it @andsel? 🙏
|
0881ea1 to
5a3bf7e
Compare
…s.<ppl_name>.batch to respect the specification (batch byte_size part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces HdrHistogram-based metrics to capture statistical information about batch sizes and event counts over time windows. The implementation provides percentile metrics (p50, p90) for 1, 5, and 15-minute intervals, enriching the existing batch metrics available through the _node/stats API. The PR also includes a refactoring that extracts common capture and retention window functionality into reusable base classes (DatapointCapture and RetentionWindow) to support both the existing flow metrics and the new histogram metrics.
Key changes:
- Introduces
HdrHistogramFlowMetricclass that maintains histogram snapshots across multiple time windows - Refactors
FlowCaptureto extend newDatapointCapturebase class and extractsRetentionWindowfromExtendedFlowMetric - Integrates histogram recording into
QueueReadClientBatchMetricsfor batch byte size and event count
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| logstash-core/build.gradle | Adds HdrHistogram dependency (version 2.2.2) |
| tools/dependencies-report/src/main/resources/licenseMapping.csv | Maps HdrHistogram to BSD-2-Clause license |
| tools/dependencies-report/src/main/resources/notices/org.hdrhistogram!HdrHistogram-NOTICE.txt | Adds BSD-2-Clause license notice for HdrHistogram |
| logstash-core/src/main/java/org/logstash/instrument/metrics/DatapointCapture.java | New abstract base class for time-stamped data captures |
| logstash-core/src/main/java/org/logstash/instrument/metrics/RetentionWindow.java | Extracted class for managing capture retention across time windows |
| logstash-core/src/main/java/org/logstash/instrument/metrics/HistogramCapture.java | New capture implementation for histogram snapshots |
| logstash-core/src/main/java/org/logstash/instrument/metrics/HistogramMetricData.java | Data class holding percentile values from histograms |
| logstash-core/src/main/java/org/logstash/instrument/metrics/HistogramFlowMetric.java | Interface definition for histogram-based flow metrics |
| logstash-core/src/main/java/org/logstash/instrument/metrics/HdrHistogramFlowMetric.java | Implementation using HdrHistogram for percentile tracking |
| logstash-core/src/main/java/org/logstash/instrument/metrics/FlowCapture.java | Refactored to extend DatapointCapture base class |
| logstash-core/src/main/java/org/logstash/instrument/metrics/ExtendedFlowMetric.java | Refactored to use extracted RetentionWindow class |
| logstash-core/src/main/java/org/logstash/instrument/metrics/FlowMetricRetentionPolicy.java | Adds samplesCount() method for memory estimation |
| logstash-core/src/main/java/org/logstash/execution/QueueReadClientBatchMetrics.java | Integrates histogram metrics for batch byte size and event count |
| logstash-core/lib/logstash/api/commands/stats.rb | Adds histogram percentile data to batch metrics response |
| logstash-core/src/test/java/org/logstash/instrument/metrics/HdrHistogramFlowMetricTest.java | Test coverage for histogram metric behavior across time windows |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Is this a problem? Probably not, because it concur to the calculation of the percentiles for a fraction, and | ||
| // giving that's a statistical approximation, doesn't really matter having exact numbers. |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammatical error in comment: "it concur" should be "it contributes". Also, "giving that's" should be "given that it's".
| // Is this a problem? Probably not, because it concur to the calculation of the percentiles for a fraction, and | |
| // giving that's a statistical approximation, doesn't really matter having exact numbers. | |
| // Is this a problem? Probably not, because it contributes to the calculation of the percentiles for a fraction, and | |
| // given that it's a statistical approximation, doesn't really matter having exact numbers. |
| // If two threads race to update lastRecordTimeNanos, only one updates the variable. If two threads reads | ||
| // currentTimeNanos that are different but really close to the other, then only one succeed in updating the atomic long, | ||
| // and will satisfy updatedLast == currentTimeNanos, so will create the snapshot. The other will read a different | ||
| // updatedLast and the condition will fail. | ||
| // If two threads reads exactly the same nanosecond time, also in this case only one will update the atomic long, but both |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammatical error in comment: "If two threads reads" should be "If two threads read" (without the 's').
| // If two threads race to update lastRecordTimeNanos, only one updates the variable. If two threads reads | |
| // currentTimeNanos that are different but really close to the other, then only one succeed in updating the atomic long, | |
| // and will satisfy updatedLast == currentTimeNanos, so will create the snapshot. The other will read a different | |
| // updatedLast and the condition will fail. | |
| // If two threads reads exactly the same nanosecond time, also in this case only one will update the atomic long, but both | |
| // If two threads race to update lastRecordTimeNanos, only one updates the variable. If two threads read | |
| // currentTimeNanos that are different but really close to the other, then only one succeed in updating the atomic long, | |
| // and will satisfy updatedLast == currentTimeNanos, so will create the snapshot. The other will read a different | |
| // updatedLast and the condition will fail. | |
| // If two threads read exactly the same nanosecond time, also in this case only one will update the atomic long, but both |
| for(int i = 0; i < 45; i++) { | ||
| sut.recordValue(100); | ||
| referenceHistogram.recordValue(100); | ||
| clock.advance(Duration.ofSeconds(1)); | ||
| } | ||
| for(int i = 0; i < 15; i++) { |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing whitespace after 'for' keyword. Should be 'for (int i = 0; i < 15; i++)' for consistency with Java coding standards.
| for(int i = 0; i < 45; i++) { | |
| sut.recordValue(100); | |
| referenceHistogram.recordValue(100); | |
| clock.advance(Duration.ofSeconds(1)); | |
| } | |
| for(int i = 0; i < 15; i++) { | |
| for (int i = 0; i < 45; i++) { | |
| sut.recordValue(100); | |
| referenceHistogram.recordValue(100); | |
| clock.advance(Duration.ofSeconds(1)); | |
| } | |
| for (int i = 0; i < 15; i++) { |
logstash-core/src/test/java/org/logstash/instrument/metrics/HdrHistogramFlowMetricTest.java
Outdated
Show resolved
Hide resolved
logstash-core/src/main/java/org/logstash/instrument/metrics/RetentionWindow.java
Show resolved
Hide resolved
| def reshape_histogram_percentiles_for_window(target_field, histogram_metric, window, result) | ||
| result[target_field][:p50] = {} if result[target_field][:p50].nil? | ||
| result[target_field][:p90] = {} if result[target_field][:p90].nil? | ||
|
|
||
| result[target_field][:p50][window] = histogram_metric.value[window.to_s].get50Percentile.round | ||
| result[target_field][:p90][window] = histogram_metric.value[window.to_s].get90Percentile.round | ||
| end |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential code duplication. The histogram_metric.value[window.to_s] is called twice for each window - once in the condition check and once inside the method. Consider storing the result in a local variable to avoid redundant method calls.
| for(int i = 0; i < 45; i++) { | ||
| sut.recordValue(100); | ||
| referenceHistogram.recordValue(100); | ||
| clock.advance(Duration.ofSeconds(1)); | ||
| } | ||
| for(int i = 0; i < 15; i++) { |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing whitespace after 'for' keyword. Should be 'for (int i = 0; i < 45; i++)' for consistency with Java coding standards.
| for(int i = 0; i < 45; i++) { | |
| sut.recordValue(100); | |
| referenceHistogram.recordValue(100); | |
| clock.advance(Duration.ofSeconds(1)); | |
| } | |
| for(int i = 0; i < 15; i++) { | |
| for (int i = 0; i < 45; i++) { | |
| sut.recordValue(100); | |
| referenceHistogram.recordValue(100); | |
| clock.advance(Duration.ofSeconds(1)); | |
| } | |
| for (int i = 0; i < 15; i++) { |
logstash-core/src/test/java/org/logstash/instrument/metrics/HdrHistogramFlowMetricTest.java
Outdated
Show resolved
Hide resolved
logstash-core/src/test/java/org/logstash/instrument/metrics/HdrHistogramFlowMetricTest.java
Outdated
Show resolved
Hide resolved
logstash-core/src/main/java/org/logstash/instrument/metrics/RetentionWindow.java
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @andsel |
Release notes
TODO
What does this PR do?
Introduce a new user metric designed to capture histograms and expose various percentiles over defined time windows. This implementation utilizes HdrHistogram and leverages the existing window policies and capture lists established during the refactoring in #18502.
The
HdrHistogramFlowMetricclass utilizes multiple windows, each containing aHistogramSnapshotsWindow. These individualHistogramSnapshotsWindowinstances are responsible for gathering histogram snapshots at intervals determined by the window policy's defined resolution.Each window maintains an HdrHistogram's
Recorderto generate incremental histogram snapshots. When values are recorded byHistogramSnapshotsWindow.recordValue, the code checks if a new snapshot needs to be created, potentially collecting it within aRetentionWindow. This value recording process is designed for parallelism.On the read side, the
HdrHistogramFlowMetric.getValuemethod processes the collected histogram snapshots to create an aggregated histogram for each time window.Why is it important/What is the impact to the user?
This PR enrich the response of batch metrics provided in
_node/statsAPI, giving a statistical view in terms of percentiles and for various time windows regarding batch size and length.Checklist
[ ] I have made corresponding changes to the documentationfollow up PR on this.[ ] I have made corresponding change to the default configuration files (and/or docker env variables)Author's Checklist
How to test this PR locally
The test consists in changing the sampling mode
pipeline.batch.metrics.sampling_mode, executing a pipeline that accepts some traffic, monitoring withcurlthe_node/statsendpoint and use a traffic generator to feed the pipeline.Change sampling mode
Edit
config/logstash.ymlto have batch sampling mode tofull:Run Logstash with a pipeline
Run Logstash with a sample pipeline like:
bin/logstash -e "input{ tcp {port => 3333} } output{ sink{} }"Monitor the API
Execute to monitor API response, and check the various percentiles for batch metrics:
Execute a traffic load script
Use
jbangto run the Java feeding script, check https://www.jbang.dev/documentation/jbang/latest/installation.html to install.Run the script in the git https://gist.github.com/andsel/d3b372b90bd66e0db98a6acc1ac32c80 with:
Related issues
Use cases
Screenshots
Logs