Coordinate cumulative histogram reset timestamps across bucket series#318
Coordinate cumulative histogram reset timestamps across bucket series#318dashpole wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for handling dynamically appearing histogram buckets (such as zero-count buckets omitted by Kong) by tracking authoritative reset timestamps in a new histogramResets map and utilizing a specialized getResetAdjustedBucket method. A comprehensive test suite has been added to verify these changes. Feedback on the changes highlights a potential memory leak, as the newly introduced histogramResets map is never cleared or garbage-collected, and suggests updating the clear and garbageCollect methods to clean up inactive hashes.
| {Ref: 3, T: 1000000, V: 10}, | ||
| {Ref: 4, T: 1000000, V: 500}, | ||
| }, | ||
| { // Scrape 2 (T=1030s): Mid-scrape request increments _count=21 while buckets reflect 20 |
There was a problem hiding this comment.
Checked manually and this does not trigger ST ingestion problems in Monarch
| {Ref: 4, T: 1060000, V: 1100}, | ||
| }, | ||
| }, | ||
| wantSkipped: []bool{true, false, false}, |
There was a problem hiding this comment.
This is true that we skip the sample in this case, but it's probably for the better, it does not harm. The proposed fix might work though (adds sample and ensures the right adjustment), if we want the complexity of it.
Still not reproducing out of order ST (manually checked)
|
(me actually writing) I tried for a while longer to reproduce the actual issue. It seems like this CAN be the cause of gaps in histogram data, but that it can't be the source of start time ordering errors. Those seem like they could only be possible when scraping the same metrics from different ports or something like that. |
Overview & Problem Statement
Enterprise Kong API gateway users on GKE reported frequent metric write rejections in Google Managed Prometheus (GMP) / Cloud Monitoring:
Points must be written in order. One or more of the points specified had an older start time than the most recent point.Root cause analysis isolated these sparse failures specifically to cumulative histogram metrics (e.g.
kong_upstream_latency_ms).The Bug in Kong & Manifestation in Prometheus Scrapes
Analysis of Kong's shared memory dictionary (
ngx.shared.dict) and coroutine yielding (kong/plugins/prometheus/prometheus.lua) revealed two scrape anomalies:_bucket): Kong omits zero-count buckets from memory to conserve storage. When lower latency observations occur on a subsequent scrape, new bucket boundary series (e.g.le="25") dynamically appear in the scrape output mid-stream.metric_data(), keys are retrieved alphabetically (_bucketbefore_countbefore_sum) withcoroutine.yield()called before each fetch. Incoming requests mid-scrape increment_countand_sumafter_buckethas already been read.GMP Exporter Failure Logic
In
google/export:getResetAdjustedtreats the series reference as uninitialized (!hasReset) and marksdist.skip = true. The entire distribution sample is skipped on Scrape N, while_countand_sumadvance their baseline tracking._countto decrease relative to the prior scrape (v < lastValue),getResetAdjustedresetsresetTimestamp = t - 1. When_countsubsequently recovers on Scrape N+1 while_sumlags, Monarch rejects the misaligned reset and start timestamps.Solution
Patched
transform.goandseries_cache.go:seriesCachetracks established cumulative histogram reset timestamps from_countseries inhistogramResets.getResetAdjustedBucket): When a bucket boundary (_bucket) arrives without prior tracking (!hasReset), if an authoritative reset timestamp was already established on an earlier scrape (rt < t), the bucket inherits the established reset timestamp and initializes baselineresetValue = 0.Verifiable Investigation & Reproduction Artifacts
For reviewers interested in verifying the full Kong simulation, memory traces, and reproduction test harness, the complete experiment is documented and archived on branch
kong-experiment-artifacts.Created by Gemini