Kong issue repro attempt#319
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes Docker build caching in Dockerfile.google, adds a helper script bw.docker.sh for building and extracting binaries, and introduces end-to-end ingestion tests under google/internal/ingeste2etest. Key feedback includes suppressing stderr in the Makefile when reading the service account key, using a trap in bw.docker.sh to clean up temporary containers on failure, avoiding resetting scrapeCount to prevent non-monotonic metric values, and guarding interactive test steps so they do not block CI/CD pipelines.
| mu.Lock() | ||
| scrapeCount = 0 | ||
| mu.Unlock() |
There was a problem hiding this comment.
Resetting scrapeCount to 0 after starting Prometheus can cause non-monotonic metric values. Since Prometheus starts scraping immediately upon startup, it may have already performed one or more scrapes before this reset. Resetting the count to 0 will cause subsequent scrapes to return lower values, which Prometheus will interpret as a counter reset or a metric decrease, potentially leading to unexpected test behavior or query results. Consider letting the count increase monotonically or using a separate endpoint for readiness checks.
| mu.Lock() | |
| scrapeCount = 0 | |
| mu.Unlock() | |
| \t// Do not reset scrapeCount to avoid non-monotonic metric values if Prometheus scraped during startup. |
| testutil.Ok(t, e2einteractive.OpenInBrowser("http://"+prom.Endpoint("http"))) | ||
| testutil.Ok(t, e2einteractive.RunUntilEndpointHit()) |
There was a problem hiding this comment.
Using e2einteractive.OpenInBrowser and e2einteractive.RunUntilEndpointHit will cause the test to block indefinitely or fail in non-interactive environments (such as CI/CD pipelines). It is recommended to guard these interactive steps with an environment variable or a command-line flag so they only run during local manual debugging.
\tif os.Getenv("INTERACTIVE") == "true" {\n\t\ttestutil.Ok(t, e2einteractive.OpenInBrowser("http://"+prom.Endpoint("http")))\n\t\ttestutil.Ok(t, e2einteractive.RunUntilEndpointHit())\n\t}| //fmt.Fprintf(&buf, "kong_kong_latency_ms_sum{case=\"healthy\"} %d\n", scrapeNum*310) | ||
| //fmt.Fprintf(&buf, "kong_kong_latency_ms_count{case=\"healthy\"} %d\n", scrapeNum*10) | ||
|
|
||
| // b. add_bucket is slowly increasing. On 3rd scrape new bucket arrives |
There was a problem hiding this comment.
For some reason those two histograms cause issue on one of those
20:03:15 gmp-prom-ingest: {"caller":"export.go:624","component":"gcm_exporter","err":"invalid bucket with negative count {__name__=\"kong_kong_latency_ms_count\", case=\"add_bucket\", instance=\"proxy-kong\", job=\"test\"}: count=30.000000, sum=930.000000, dev=4056.000000, index=1, bucketVal=5, bucketPrevVal=6","level":"debug","msg":"building sample failed","ts":"2026-06-25T19:03:15.914Z"}
20:03:21 gmp-prom-ingest: {"caller":"export.go:1037","component":"gcm_exporter","err":"rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[2,6] (example metric.type=\"prometheus.googleapis.com/kong_kong_latency_ms/histogram\", metric.labels={\"case\": \"counter-missing\", \"collector\": \"gmp-prom-ingest\"}, resource.type=\"prometheus_target\", resource.labels={\"cluster\": \"pe-github-action\", \"location\": \"europe-west3-a\", \"instance\": \"proxy-kong\", \"job\": \"test\", \"project_id\": \"gpe-test-1\", \"namespace\": \"\"}): write for resource failed: Points must be written in order. One or more of the points specified had an older start time than the most recent point.\nerror details: name = Unknown desc = total_point_count:7 success_point_count:6 errors:{status:{code:3} point_count:1}","level":"error","msg":"send batch","size":7,"ts":"2026-06-25T19:03:21.047Z"}
20:03:25 gmp-prom-ingest: {"caller":"export.go:1037","component":"gcm_exporter","err":"rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[2,6] (example metric.type=\"prometheus.googleapis.com/kong_kong_latency_ms/histogram\", metric.labels={\"case\": \"add_bucket\", \"collector\": \"gmp-prom-ingest\"}, resource.type=\"prometheus_target\", resource.labels={\"namespace\": \"\", \"project_id\": \"gpe-test-1\", \"cluster\": \"pe-github-action\", \"instance\": \"proxy-kong\", \"job\": \"test\", \"location\": \"europe-west3-a\"}): write for resource failed: Points must be written in order. One or more of the points specified had an older start time than the most recent point.\nerror details: name = Unknown desc = total_point_count:7 success_point_count:6 errors:{status:{code:3} point_count:1}","level":"error","msg":"send batch","size":7,"ts":"2026-06-25T19:03:26.015Z"}
^C20:03:26 Killing gmp-prom-ingest
Notably when I added more logging this does not repro OOO ST, so there's some race here.
Signed-off-by: bwplotka <bwplotka@gmail.com>
Add gmp-test-sa-key.json with monitoring write access - I used gpe-test-1.