Skip to content

Conversation

@andsel
Copy link
Contributor

@andsel andsel commented Dec 17, 2025

Release notes

TODO

What does this PR do?

Introduce a new user metric designed to capture histograms and expose various percentiles over defined time windows. This implementation utilizes HdrHistogram and leverages the existing window policies and capture lists established during the refactoring in #18502.
The HdrHistogramFlowMetric class utilizes multiple windows, each containing a HistogramSnapshotsWindow. These individual HistogramSnapshotsWindow instances are responsible for gathering histogram snapshots at intervals determined by the window policy's defined resolution.

Each window maintains an HdrHistogram's Recorder to generate incremental histogram snapshots. When values are recorded by HistogramSnapshotsWindow.recordValue, the code checks if a new snapshot needs to be created, potentially collecting it within a RetentionWindow. This value recording process is designed for parallelism.

On the read side, the HdrHistogramFlowMetric.getValue method processes the collected histogram snapshots to create an aggregated histogram for each time window.

Why is it important/What is the impact to the user?

This PR enrich the response of batch metrics provided in _node/stats API, giving a statistical view in terms of percentiles and for various time windows regarding batch size and length.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation follow up PR on this.
  • [ ] I have made corresponding change to the default configuration files (and/or docker env variables)
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

  • update licenses
  • create histogram flow also for batch event count
  • cover with test, in particular check 1m window after a minute with no events should go down to 0
  • reshape the response in logstash-core/lib/logstash/api/commands/stats.rb

How to test this PR locally

The test consists in changing the sampling mode pipeline.batch.metrics.sampling_mode, executing a pipeline that accepts some traffic, monitoring with curl the _node/stats endpoint and use a traffic generator to feed the pipeline.

Change sampling mode

Edit config/logstash.yml to have batch sampling mode to full:

pipeline.batch.metrics.sampling_mode: full

Run Logstash with a pipeline

Run Logstash with a sample pipeline like:

bin/logstash -e "input{ tcp {port => 3333} } output{ sink{} }"

Monitor the API

Execute to monitor API response, and check the various percentiles for batch metrics:

while true; do curl http://localhost:9600/_node/stats | jq .pipelines.main.batch; sleep 3; clear; done

Execute a traffic load script

Use jbang to run the Java feeding script, check https://www.jbang.dev/documentation/jbang/latest/installation.html to install.
Run the script in the git https://gist.github.com/andsel/d3b372b90bd66e0db98a6acc1ac32c80 with:

./TrafficSimulator.java -p 3333

Related issues

Use cases

Screenshots

Logs

@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)
  • /run exhaustive tests : Run the exhaustive tests Buildkite pipeline.

@mergify
Copy link
Contributor

mergify bot commented Dec 17, 2025

This pull request does not have a backport label. Could you fix it @andsel? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

@andsel andsel self-assigned this Dec 19, 2025
@andsel andsel force-pushed the feature/create_histogram_flowmetrics branch from 0881ea1 to 5a3bf7e Compare December 19, 2025 13:17
@andsel andsel changed the title Feature/create histogram flowmetrics Present percentiles of batch size and event count within 1, 5, 15 minutes windows. Dec 22, 2025
@andsel andsel changed the title Present percentiles of batch size and event count within 1, 5, 15 minutes windows. Return percentiles metrics for batch size and event count within 1, 5, 15 minutes windows. Dec 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces HdrHistogram-based metrics to capture statistical information about batch sizes and event counts over time windows. The implementation provides percentile metrics (p50, p90) for 1, 5, and 15-minute intervals, enriching the existing batch metrics available through the _node/stats API. The PR also includes a refactoring that extracts common capture and retention window functionality into reusable base classes (DatapointCapture and RetentionWindow) to support both the existing flow metrics and the new histogram metrics.

Key changes:

  • Introduces HdrHistogramFlowMetric class that maintains histogram snapshots across multiple time windows
  • Refactors FlowCapture to extend new DatapointCapture base class and extracts RetentionWindow from ExtendedFlowMetric
  • Integrates histogram recording into QueueReadClientBatchMetrics for batch byte size and event count

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
logstash-core/build.gradle Adds HdrHistogram dependency (version 2.2.2)
tools/dependencies-report/src/main/resources/licenseMapping.csv Maps HdrHistogram to BSD-2-Clause license
tools/dependencies-report/src/main/resources/notices/org.hdrhistogram!HdrHistogram-NOTICE.txt Adds BSD-2-Clause license notice for HdrHistogram
logstash-core/src/main/java/org/logstash/instrument/metrics/DatapointCapture.java New abstract base class for time-stamped data captures
logstash-core/src/main/java/org/logstash/instrument/metrics/RetentionWindow.java Extracted class for managing capture retention across time windows
logstash-core/src/main/java/org/logstash/instrument/metrics/HistogramCapture.java New capture implementation for histogram snapshots
logstash-core/src/main/java/org/logstash/instrument/metrics/HistogramMetricData.java Data class holding percentile values from histograms
logstash-core/src/main/java/org/logstash/instrument/metrics/HistogramFlowMetric.java Interface definition for histogram-based flow metrics
logstash-core/src/main/java/org/logstash/instrument/metrics/HdrHistogramFlowMetric.java Implementation using HdrHistogram for percentile tracking
logstash-core/src/main/java/org/logstash/instrument/metrics/FlowCapture.java Refactored to extend DatapointCapture base class
logstash-core/src/main/java/org/logstash/instrument/metrics/ExtendedFlowMetric.java Refactored to use extracted RetentionWindow class
logstash-core/src/main/java/org/logstash/instrument/metrics/FlowMetricRetentionPolicy.java Adds samplesCount() method for memory estimation
logstash-core/src/main/java/org/logstash/execution/QueueReadClientBatchMetrics.java Integrates histogram metrics for batch byte size and event count
logstash-core/lib/logstash/api/commands/stats.rb Adds histogram percentile data to batch metrics response
logstash-core/src/test/java/org/logstash/instrument/metrics/HdrHistogramFlowMetricTest.java Test coverage for histogram metric behavior across time windows

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 89 to 90
// Is this a problem? Probably not, because it concur to the calculation of the percentiles for a fraction, and
// giving that's a statistical approximation, doesn't really matter having exact numbers.
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error in comment: "it concur" should be "it contributes". Also, "giving that's" should be "given that it's".

Suggested change
// Is this a problem? Probably not, because it concur to the calculation of the percentiles for a fraction, and
// giving that's a statistical approximation, doesn't really matter having exact numbers.
// Is this a problem? Probably not, because it contributes to the calculation of the percentiles for a fraction, and
// given that it's a statistical approximation, doesn't really matter having exact numbers.

Copilot uses AI. Check for mistakes.
Comment on lines 81 to 85
// If two threads race to update lastRecordTimeNanos, only one updates the variable. If two threads reads
// currentTimeNanos that are different but really close to the other, then only one succeed in updating the atomic long,
// and will satisfy updatedLast == currentTimeNanos, so will create the snapshot. The other will read a different
// updatedLast and the condition will fail.
// If two threads reads exactly the same nanosecond time, also in this case only one will update the atomic long, but both
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error in comment: "If two threads reads" should be "If two threads read" (without the 's').

Suggested change
// If two threads race to update lastRecordTimeNanos, only one updates the variable. If two threads reads
// currentTimeNanos that are different but really close to the other, then only one succeed in updating the atomic long,
// and will satisfy updatedLast == currentTimeNanos, so will create the snapshot. The other will read a different
// updatedLast and the condition will fail.
// If two threads reads exactly the same nanosecond time, also in this case only one will update the atomic long, but both
// If two threads race to update lastRecordTimeNanos, only one updates the variable. If two threads read
// currentTimeNanos that are different but really close to the other, then only one succeed in updating the atomic long,
// and will satisfy updatedLast == currentTimeNanos, so will create the snapshot. The other will read a different
// updatedLast and the condition will fail.
// If two threads read exactly the same nanosecond time, also in this case only one will update the atomic long, but both

Copilot uses AI. Check for mistakes.
Comment on lines 56 to 61
for(int i = 0; i < 45; i++) {
sut.recordValue(100);
referenceHistogram.recordValue(100);
clock.advance(Duration.ofSeconds(1));
}
for(int i = 0; i < 15; i++) {
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace after 'for' keyword. Should be 'for (int i = 0; i < 15; i++)' for consistency with Java coding standards.

Suggested change
for(int i = 0; i < 45; i++) {
sut.recordValue(100);
referenceHistogram.recordValue(100);
clock.advance(Duration.ofSeconds(1));
}
for(int i = 0; i < 15; i++) {
for (int i = 0; i < 45; i++) {
sut.recordValue(100);
referenceHistogram.recordValue(100);
clock.advance(Duration.ofSeconds(1));
}
for (int i = 0; i < 15; i++) {

Copilot uses AI. Check for mistakes.
Comment on lines 226 to 232
def reshape_histogram_percentiles_for_window(target_field, histogram_metric, window, result)
result[target_field][:p50] = {} if result[target_field][:p50].nil?
result[target_field][:p90] = {} if result[target_field][:p90].nil?

result[target_field][:p50][window] = histogram_metric.value[window.to_s].get50Percentile.round
result[target_field][:p90][window] = histogram_metric.value[window.to_s].get90Percentile.round
end
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential code duplication. The histogram_metric.value[window.to_s] is called twice for each window - once in the condition check and once inside the method. Consider storing the result in a local variable to avoid redundant method calls.

Copilot uses AI. Check for mistakes.
Comment on lines 56 to 61
for(int i = 0; i < 45; i++) {
sut.recordValue(100);
referenceHistogram.recordValue(100);
clock.advance(Duration.ofSeconds(1));
}
for(int i = 0; i < 15; i++) {
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace after 'for' keyword. Should be 'for (int i = 0; i < 45; i++)' for consistency with Java coding standards.

Suggested change
for(int i = 0; i < 45; i++) {
sut.recordValue(100);
referenceHistogram.recordValue(100);
clock.advance(Duration.ofSeconds(1));
}
for(int i = 0; i < 15; i++) {
for (int i = 0; i < 45; i++) {
sut.recordValue(100);
referenceHistogram.recordValue(100);
clock.advance(Duration.ofSeconds(1));
}
for (int i = 0; i < 15; i++) {

Copilot uses AI. Check for mistakes.
andsel and others added 2 commits December 23, 2025 15:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@elasticmachine
Copy link

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @andsel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants