-
Notifications
You must be signed in to change notification settings - Fork 976
dnsdist: Add an histogram of health-check latencies for backends #16668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
dnsdist: Add an histogram of health-check latencies for backends #16668
Conversation
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Pull Request Test Coverage Report for Build 20429409923Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
Maybe it's an idea to start using |
omoerbeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But see comment
I have now taken a good look at it, thanks. I am not really convinced it would make sense to use it given that our boundaries are fixed and always the same: the cleverness needed to handle custom boundaries would add complexity to the handling of our metrics and we can't reuse the logic present in the recursor because our metrics are very different. It'll be different once we have a histogram with different boundaries, of course. |
pieterlexis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. One issue and one question.
| backend_latency_amount += state->d_healthCheckLatencyHisto.latency10_50; | ||
| output << statesbase << "healthchecklatency_histo_bucket" << latency_label_prefix << ",le=\"50\"} " << backend_latency_amount << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines are doubled from the 2 lines above. i.e. The latency10_50 stat is output and counted twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, perhaps it means that more generic logic would actually be useful, I'll have a look. And I guess a more specific regression test would help. I'm surprised promtool is not complaining.
| stat_t latency1_10{0}; | ||
| stat_t latency10_50{0}; | ||
| stat_t latency50_100{0}; | ||
| stat_t latency100_1000{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have
latency100_500
latency500_1000
instead of a single 900 ms bucket. This also "mirrors" the 10_50 and 50_100 buckets better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but then we can no longer share the same code with the existing histogram without breaking anyone relying on the existing buckets. Is it worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed buckets should automatically work in most metrics apps. But we can test/ask around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also export the latency values used for the existing histogram as individual, regular metrics via carbon, API, SNMP (so, at least a MIB update) and dumpStats, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, then nevermind. Perhaps this is something for another time.
Short description
The existing metric only keeps the latency of the latest successful health-check query, which is not very useful to keep track of latency spikes because it changes every second.
Checklist
I have: