dnsdist: Add an histogram of health-check latencies for backends #16668

rgacogne · 2025-12-19T10:23:31Z

Short description

The existing metric only keeps the latency of the latest successful health-check query, which is not very useful to keep track of latency spikes because it changes every second.

Checklist

I have:

read the CONTRIBUTING.md document
read and accepted the Developer Certificate of Origin document, including the AI Policy, and added a "Signed-off-by" to my commits
compiled this code
tested this code
included documentation (including possible behaviour changes)
documented the code
added or modified regression test(s)
added or modified unit test(s)

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>

pdns/dnsdistdist/dnsdist-healthchecks.cc

pdns/dnsdistdist/dnsdist-web.cc

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>

coveralls · 2025-12-22T11:05:33Z

Pull Request Test Coverage Report for Build 20429409923

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

74 of 79 (93.67%) changed or added relevant lines in 8 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on ddist-health-check-latency-bucket at 73.346%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pdns/dnsdistdist/dnsdist-metrics.hh	20	22	90.91%
pdns/dnsdistdist/dnsdist.cc	9	12	75.0%

Totals
Change from base Build 20365946580:	73.3%
Covered Lines:	128869
Relevant Lines:	164911

💛 - Coveralls

omoerbeek · 2026-01-06T07:20:45Z

Maybe it's an idea to start using histogram.hh at some point? It currently lives in recursordist, but it's generic code.

omoerbeek

But see comment

rgacogne · 2026-01-06T13:36:00Z

Maybe it's an idea to start using histogram.hh at some point? It currently lives in recursordist, but it's generic code.

I have now taken a good look at it, thanks. I am not really convinced it would make sense to use it given that our boundaries are fixed and always the same: the cleverness needed to handle custom boundaries would add complexity to the handling of our metrics and we can't reuse the logic present in the recursor because our metrics are very different. It'll be different once we have a histogram with different boundaries, of course.

pieterlexis

looks good. One issue and one question.

pieterlexis · 2026-01-06T14:09:05Z

pdns/dnsdistdist/dnsdist-web.cc

+    backend_latency_amount += state->d_healthCheckLatencyHisto.latency10_50;
+    output << statesbase << "healthchecklatency_histo_bucket" << latency_label_prefix << ",le=\"50\"} " << backend_latency_amount << "\n";


these lines are doubled from the 2 lines above. i.e. The latency10_50 stat is output and counted twice.

Good catch, perhaps it means that more generic logic would actually be useful, I'll have a look. And I guess a more specific regression test would help. I'm surprised promtool is not complaining.

pieterlexis · 2026-01-06T14:12:58Z

pdns/dnsdistdist/dnsdist.hh

+    stat_t latency1_10{0};
+    stat_t latency10_50{0};
+    stat_t latency50_100{0};
+    stat_t latency100_1000{0};


I wonder if we should have

latency100_500 latency500_1000

instead of a single 900 ms bucket. This also "mirrors" the 10_50 and 50_100 buckets better.

Right, but then we can no longer share the same code with the existing histogram without breaking anyone relying on the existing buckets. Is it worth it?

Changed buckets should automatically work in most metrics apps. But we can test/ask around.

We also export the latency values used for the existing histogram as individual, regular metrics via carbon, API, SNMP (so, at least a MIB update) and dumpStats, though.

oh, then nevermind. Perhaps this is something for another time.

dnsdist: Add an histogram of health-check latencies for backends

8f607f9

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>

rgacogne added this to the dnsdist-2.1.0 milestone Dec 19, 2025

rgacogne added enhancement dnsdist backport to dnsdist-2.0.x labels Dec 19, 2025

miodvallat reviewed Dec 22, 2025

View reviewed changes

pdns/dnsdistdist/dnsdist-healthchecks.cc Show resolved Hide resolved

pdns/dnsdistdist/dnsdist-web.cc Show resolved Hide resolved

rgacogne added 2 commits December 22, 2025 11:25

dnsdist: Unify histogram updates

7d9d1c7

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>

dnsdist: Clean up the type mess around latency metrics

b4e4b78

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>

rgacogne requested review from omoerbeek and pieterlexis December 29, 2025 08:55

omoerbeek approved these changes Jan 6, 2026

View reviewed changes

pieterlexis requested changes Jan 6, 2026

View reviewed changes

		backend_latency_amount += state->d_healthCheckLatencyHisto.latency10_50;
		output << statesbase << "healthchecklatency_histo_bucket" << latency_label_prefix << ",le=\"50\"} " << backend_latency_amount << "\n";

dnsdist: Add an histogram of health-check latencies for backends #16668

Are you sure you want to change the base?

dnsdist: Add an histogram of health-check latencies for backends #16668

Conversation

rgacogne commented Dec 19, 2025

Short description

Checklist

Uh oh!

Uh oh!

Uh oh!

coveralls commented Dec 22, 2025

Pull Request Test Coverage Report for Build 20429409923

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

omoerbeek commented Jan 6, 2026

Uh oh!

omoerbeek left a comment

Choose a reason for hiding this comment

Uh oh!

rgacogne commented Jan 6, 2026

Uh oh!

pieterlexis left a comment

Choose a reason for hiding this comment

Uh oh!

pieterlexis Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

rgacogne Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

pieterlexis Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

rgacogne Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

pieterlexis Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

rgacogne Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

pieterlexis Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants