Coordinate cumulative histogram reset timestamps across bucket series by dashpole · Pull Request #318 · GoogleCloudPlatform/prometheus

dashpole · 2026-06-24T21:18:39Z

Overview & Problem Statement

Enterprise Kong API gateway users on GKE reported frequent metric write rejections in Google Managed Prometheus (GMP) / Cloud Monitoring:
Points must be written in order. One or more of the points specified had an older start time than the most recent point.

Root cause analysis isolated these sparse failures specifically to cumulative histogram metrics (e.g. kong_upstream_latency_ms).

The Bug in Kong & Manifestation in Prometheus Scrapes

Analysis of Kong's shared memory dictionary (ngx.shared.dict) and coroutine yielding (kong/plugins/prometheus/prometheus.lua) revealed two scrape anomalies:

Omitted Zero Buckets (_bucket): Kong omits zero-count buckets from memory to conserve storage. When lower latency observations occur on a subsequent scrape, new bucket boundary series (e.g. le="25") dynamically appear in the scrape output mid-stream.
Mid-Scrape Yield Desynchronization: In metric_data(), keys are retrieved alphabetically (_bucket before _count before _sum) with coroutine.yield() called before each fetch. Incoming requests mid-scrape increment _count and _sum after _bucket has already been read.

GMP Exporter Failure Logic

In google/export:

When a newly appearing zero bucket arrives mid-stream on Scrape N (>1), getResetAdjusted treats the series reference as uninitialized (!hasReset) and marks dist.skip = true. The entire distribution sample is skipped on Scrape N, while _count and _sum advance their baseline tracking.
When worker jitter causes _count to decrease relative to the prior scrape (v < lastValue), getResetAdjusted resets resetTimestamp = t - 1. When _count subsequently recovers on Scrape N+1 while _sum lags, Monarch rejects the misaligned reset and start timestamps.

Solution

Patched transform.go and series_cache.go:

Authoritative Reset Coordination: seriesCache tracks established cumulative histogram reset timestamps from _count series in histogramResets.
Dynamic Bucket Normalization (getResetAdjustedBucket): When a bucket boundary (_bucket) arrives without prior tracking (!hasReset), if an authoritative reset timestamp was already established on an earlier scrape (rt < t), the bucket inherits the established reset timestamp and initializes baseline resetValue = 0.

Verifiable Investigation & Reproduction Artifacts

For reviewers interested in verifying the full Kong simulation, memory traces, and reproduction test harness, the complete experiment is documented and archived on branch kong-experiment-artifacts.

Created by Gemini

gemini-code-assist

Code Review

This pull request introduces support for handling dynamically appearing histogram buckets (such as zero-count buckets omitted by Kong) by tracking authoritative reset timestamps in a new histogramResets map and utilizing a specialized getResetAdjustedBucket method. A comprehensive test suite has been added to verify these changes. Feedback on the changes highlights a potential memory leak, as the newly introduced histogramResets map is never cleared or garbage-collected, and suggests updating the clear and garbageCollect methods to clean up inactive hashes.

…tion

bwplotka · 2026-06-24T23:15:16Z

+					{Ref: 3, T: 1000000, V: 10},
+					{Ref: 4, T: 1000000, V: 500},
+				},
+				{ // Scrape 2 (T=1030s): Mid-scrape request increments _count=21 while buckets reflect 20


Checked manually and this does not trigger ST ingestion problems in Monarch

bwplotka · 2026-06-24T23:24:08Z

+					{Ref: 4, T: 1060000, V: 1100},
+				},
+			},
+			wantSkipped: []bool{true, false, false},


This is true that we skip the sample in this case, but it's probably for the better, it does not harm. The proposed fix might work though (adds sample and ensures the right adjustment), if we want the complexity of it.

Still not reproducing out of order ST (manually checked)

dashpole · 2026-06-25T03:26:17Z

(me actually writing)

I tried for a while longer to reproduce the actual issue. It seems like this CAN be the cause of gaps in histogram data, but that it can't be the source of start time ordering errors. Those seem like they could only be possible when scraping the same metrics from different ports or something like that.

Coordinate cumulative histogram reset timestamps across bucket series

b28fb05

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread google/export/series_cache.go

Clean up inactive cumulative histogram reset timestamps on cache evic…

f24a0e7

…tion

bwplotka reviewed Jun 24, 2026

View reviewed changes

Allow setting MetadataFunc on Storage for cumulative histogram testing

6ed49b1

dashpole closed this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Coordinate cumulative histogram reset timestamps across bucket series#318

Coordinate cumulative histogram reset timestamps across bucket series#318
dashpole wants to merge 3 commits into
GoogleCloudPlatform:release-2.53.5-gmpfrom
dashpole:fix-kong-histogram-timestamps

dashpole commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

bwplotka Jun 24, 2026

Uh oh!

bwplotka Jun 24, 2026

Uh oh!

dashpole commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dashpole commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview & Problem Statement

The Bug in Kong & Manifestation in Prometheus Scrapes

GMP Exporter Failure Logic

Solution

Verifiable Investigation & Reproduction Artifacts

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bwplotka Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

bwplotka Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

dashpole commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dashpole commented Jun 24, 2026 •

edited

Loading