Skip to content

Conversation

@rgacogne
Copy link
Member

Short description

The existing metric only keeps the latency of the latest successful health-check query, which is not very useful to keep track of latency spikes because it changes every second.

Checklist

I have:

  • read the CONTRIBUTING.md document
  • read and accepted the Developer Certificate of Origin document, including the AI Policy, and added a "Signed-off-by" to my commits
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
@coveralls
Copy link

Pull Request Test Coverage Report for Build 20429409923

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 74 of 79 (93.67%) changed or added relevant lines in 8 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on ddist-health-check-latency-bucket at 73.346%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pdns/dnsdistdist/dnsdist-metrics.hh 20 22 90.91%
pdns/dnsdistdist/dnsdist.cc 9 12 75.0%
Totals Coverage Status
Change from base Build 20365946580: 73.3%
Covered Lines: 128869
Relevant Lines: 164911

💛 - Coveralls

@omoerbeek
Copy link
Member

Maybe it's an idea to start using histogram.hh at some point? It currently lives in recursordist, but it's generic code.

Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But see comment

@rgacogne
Copy link
Member Author

rgacogne commented Jan 6, 2026

Maybe it's an idea to start using histogram.hh at some point? It currently lives in recursordist, but it's generic code.

I have now taken a good look at it, thanks. I am not really convinced it would make sense to use it given that our boundaries are fixed and always the same: the cleverness needed to handle custom boundaries would add complexity to the handling of our metrics and we can't reuse the logic present in the recursor because our metrics are very different. It'll be different once we have a histogram with different boundaries, of course.

Copy link
Contributor

@pieterlexis pieterlexis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. One issue and one question.

Comment on lines +681 to +682
backend_latency_amount += state->d_healthCheckLatencyHisto.latency10_50;
output << statesbase << "healthchecklatency_histo_bucket" << latency_label_prefix << ",le=\"50\"} " << backend_latency_amount << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines are doubled from the 2 lines above. i.e. The latency10_50 stat is output and counted twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, perhaps it means that more generic logic would actually be useful, I'll have a look. And I guess a more specific regression test would help. I'm surprised promtool is not complaining.

stat_t latency1_10{0};
stat_t latency10_50{0};
stat_t latency50_100{0};
stat_t latency100_1000{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have

latency100_500
latency500_1000

instead of a single 900 ms bucket. This also "mirrors" the 10_50 and 50_100 buckets better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then we can no longer share the same code with the existing histogram without breaking anyone relying on the existing buckets. Is it worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed buckets should automatically work in most metrics apps. But we can test/ask around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also export the latency values used for the existing histogram as individual, regular metrics via carbon, API, SNMP (so, at least a MIB update) and dumpStats, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, then nevermind. Perhaps this is something for another time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants