From 626883211610a4271c572e846ef826fded4cb547 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:10:10 +0100 Subject: [PATCH 1/8] =?UTF-8?q?docs:=20add=20focused=20Monitoring=20&=20Lo?= =?UTF-8?q?gging=20=E2=86=94=20Operations=20BB=20design=20(data-access#202?= =?UTF-8?q?)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a narrow, ~15h-scoped first-delivery design for #202: clarify the eoAPI ↔ Operations BB monitoring/logging contract and make eoAPI's existing metrics part of the Operations BB by reusing what already exists (ingress metrics already scraped by kube-prometheus-stack, logs already shipped by Alloy, the chart's existing dashboard ConfigMap mechanism) — no new images, app changes, or backends. Includes the integration contract, a 15h task breakdown, an explicit out-of-scope/roadmap pointer, and costed upstream-improvement proposals for eoapi-k8s and the eoAPI apps. Registered under Design in the nav. Leaves the broader observability roadmap doc untouched. Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 128 +++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 129 insertions(+) create mode 100644 docs/design/observability-operations-bb.md diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md new file mode 100644 index 0000000..d571469 --- /dev/null +++ b/docs/design/observability-operations-bb.md @@ -0,0 +1,128 @@ +# Monitoring & Logging integration with the Operations BB + +> **Status: Draft for discussion.** Focused first-delivery design for +> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202). A broader, +> longer-term observability roadmap (full OpenTelemetry: traces, per-collection analytics) is +> tracked separately in [observability.md](./observability.md); this document is intentionally +> narrower and scoped to a single ~15h increment. + +## Goal & acceptance criterion + +data-access#202 asks for one thing: + +> *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging +> capabilities of eoAPI."* + +and notes that *"eoAPI already collects metrics and offers observability tools such as Grafana. +The goal of this ticket is to support that these metrics become part of the Operations Building +Block."* + +So this delivery is fundamentally **clarification + reuse of what already exists**, not a new +instrumentation stack. Concretely it produces: + +1. A written **contract** describing what eoAPI exposes and how the Operations BB consumes it. +2. A thin **demonstration** that makes that contract real: eoAPI service health visible in the + Operations BB's existing Grafana, driven by metrics that already exist, plus a few alert rules + — shipped through the eoapi-k8s chart and enabled on EOEPCA by values only. + +**Primary target: eoapi-k8s** (the chart we maintain), then a values-only adaptation for the +EOEPCA deployment. + +## What already exists (and is therefore reused, not rebuilt) + +| Layer | Already present | Source | +|---|---|---| +| **Request metrics** | Per-service / per-route request counts (by status) and request-duration histograms at the ingress | ingress controller metrics, already scraped by the cluster Prometheus (the chart's autoscaling uses them) | +| **Metrics backend** | kube-prometheus-stack (Prometheus + Grafana + Alertmanager), 30d / 50Gi retention; `ServiceMonitor`/`PodMonitor` is the scrape path (`scrapeConfig` disabled) | `eoepca-plus` `argocd/operations/monitoring/` | +| **Logs** | Alloy DaemonSet auto-collects **every pod's stdout → Loki**; Grafana has a Loki datasource (`uid: loki`) | `eoepca-plus` `argocd/operations/monitoring/alloy/` | +| **Dashboards** | eoapi-k8s ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) | `charts/eoapi/templates/monitoring/observability.yaml` | +| **Grafana access** | SSO/OIDC with RBAC roles | `eoepca-plus` kube-prometheus-stack values | + +The gap is small: the bundled eoAPI dashboard is resource/autoscaling-oriented (CPU/mem/pods + +request rate). It lacks operator-facing **error-rate** and **latency-percentile per service** +views, and there are no eoAPI-specific alert rules. That gap is what this delivery fills. + +## The contract — eoAPI ↔ Operations BB (the acceptance artifact) + +**What eoAPI exposes** +- **Metrics:** Prometheus-format metrics. Today these are request rate / errors / latency at the + **ingress layer** (per service and route-prefix), already scraped by the cluster Prometheus. + (App-level metrics — DB pool, internal latency — are a documented later increment; see + *Upstream improvements*.) +- **Logs:** structured logs to **stdout**, collected by the platform's Alloy → Loki pipeline. No + log push from eoAPI; the platform owns collection. +- **Standards:** Prometheus exposition for metrics; OpenTelemetry is the chosen standard for the + later traces increment (backend-neutral via OTLP). + +**What the Operations BB provides / owns** +- Scrapes eoAPI metrics via the existing kube-prometheus-stack (`ServiceMonitor`/`PodMonitor`). +- Stores and visualizes metrics (Prometheus + Grafana) and logs (Loki, fed by Alloy). +- Owns **retention, access control (SSO/RBAC), and alert routing** (Alertmanager). + +**Shared / handed over by this delivery** +- An "eoAPI operations" **Grafana dashboard** (shipped as a ConfigMap by the chart, imported by + the Operations BB's Grafana sidecar). +- A small set of **PrometheusRules** (availability / error-rate / latency), opt-in via chart + values, wired into the existing Alertmanager. + +**Boundaries (explicit)** +- eoAPI does **not** run its own Prometheus/Grafana/Loki in the cluster — it integrates with the + Operations BB's. The chart's bundled monitoring components stay disabled on EOEPCA. +- Metric label cardinality is bounded by design (no per-collection / per-tile labels in this + delivery) to respect the cluster's 30d/50Gi budget. + +## First delivery — scope and effort (~15h) + +| # | Task | Effort | +|---|---|---| +| 1 | Write this contract (the acceptance artifact) and circulate for Operations BB review | 3h | +| 2 | Confirm on `develop` whether eoAPI is fronted by **nginx-ingress or APISIX**, that request-duration histograms are scraped, and capture exact metric names | 2h | +| 3 | Build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h | +| 4 | Add **opt-in PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h | +| 5 | **EOEPCA enablement = values only** — PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`) enabling the dashboard + rules and confirming the scrape | 2h | +| 6 | Finalize this doc + a short **#202 note**; hand off for Operations BB sign-off | 2h | +| | **Total** | **15h** | + +No new container images, no application code changes, and no new backend services — that is what +keeps this within 15h. + +## Out of scope (tracked in the roadmap, not this delivery) +- OpenTelemetry auto-instrumentation (DB spans, internal latency). +- Distributed **traces** and trace↔log correlation — also blocked by **Tempo not being deployed** + on EOEPCA today. +- **Per-collection analytics** (the eoAPI#193 headline) — needs bounded custom code and a + cardinality budget agreed with the Operations BB. + +See [observability.md](./observability.md) for the full roadmap and the rationale. + +## Upstream improvements (do-it-well, if time allows) + +Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home for these is upstream — not +a per-deployment overlay. These are **separate allocations**, not part of the 15h. Estimates are +engineering hours (tests/docs/PR/release included); upstream PRs also carry review/CI/release +latency beyond these hours. + +| # | Upstream item | Effort | +|---|---|---| +| U1 | **eoapi-k8s** — opt-in `observability` extension: `ServiceMonitor` template + values toggle, and the ops panels folded into the bundled dashboard | 6–9h | +| U2 | **eoapi-k8s** — shared `OTEL_*` env passthrough scaffolding (a `telemetry` block) | 3–5h | +| U3 | **eoAPI apps** (titiler-pgstac / stac-fastapi-pgstac / tipg) — opt-in native Prometheus `/metrics` endpoint, off by default (app-level latency/error + DB-pool metrics) | 4–6h per app (~12–18h all three) | +| U4 | **eoAPI apps** — opt-in OpenTelemetry baked into images (env-driven, off by default); riskiest (dependency-conflict + multi-worker validation); only worthwhile once traces are wanted and Tempo exists | 5–7h per app (~15–20h all three) | +| U5 | **eoAPI apps** — bounded, opt-in per-collection metric (#193) with cardinality guards + dashboard panel | 8–14h | + +**Suggested packaging:** +- *Phase A (~10–15h)* — U1 + U3 (one app): make app-level metrics first-class in the chart and + one app. A natural second ~15h slot after this delivery. +- *Phase B (~12–18h)* — U3 across the remaining apps. +- *Phase C (~20–25h)* — U2 + U4 (traces), gated on Tempo being deployed on EOEPCA. +- *Phase D (~8–14h)* — U5 (per-collection), gated on a cardinality budget agreed with Ops BB. + +## Verification / acceptance +- **AC met:** the contract above is reviewed and accepted by the Operations BB — the single #202 + checkbox. +- **Demonstration:** on `develop`, eoAPI service health (request rate / error rate / latency + percentiles per service) is visible in the **existing** Grafana via the shipped dashboard; the + alert rules load; logs are queryable in the **existing** Loki — with no new images and no app + changes. +- **Metric-source check (task 2)** confirms nginx vs APISIX and that duration histograms are + scraped before the dashboard queries are finalized. diff --git a/mkdocs.yml b/mkdocs.yml index 3b48f60..5829e5d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,6 +10,7 @@ nav: - Design: - design/overview.md - design/ogc-api-maps.md + - design/observability-operations-bb.md - API: - api/endpoint-specification.md - api/health-checks.md From 1d1567ca2f259f4d4ce9ac7c287dac67bf57c7c3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:15:40 +0100 Subject: [PATCH 2/8] docs: remove references to the unpublished broader observability doc Avoid confusion: the broader roadmap doc was never published, so drop the cross-links and keep this design self-contained. Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index d571469..b6bdcd0 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -1,10 +1,9 @@ # Monitoring & Logging integration with the Operations BB > **Status: Draft for discussion.** Focused first-delivery design for -> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202). A broader, -> longer-term observability roadmap (full OpenTelemetry: traces, per-collection analytics) is -> tracked separately in [observability.md](./observability.md); this document is intentionally -> narrower and scoped to a single ~15h increment. +> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), intentionally +> scoped to a single ~15h increment. Broader observability work (full OpenTelemetry: traces, +> per-collection analytics) is captured as future roadmap in the *Out of scope* section below. ## Goal & acceptance criterion @@ -86,14 +85,14 @@ views, and there are no eoAPI-specific alert rules. That gap is what this delive No new container images, no application code changes, and no new backend services — that is what keeps this within 15h. -## Out of scope (tracked in the roadmap, not this delivery) +## Out of scope (future roadmap, not this delivery) - OpenTelemetry auto-instrumentation (DB spans, internal latency). - Distributed **traces** and trace↔log correlation — also blocked by **Tempo not being deployed** on EOEPCA today. - **Per-collection analytics** (the eoAPI#193 headline) — needs bounded custom code and a cardinality budget agreed with the Operations BB. -See [observability.md](./observability.md) for the full roadmap and the rationale. +These are sketched in *Upstream improvements* below as costed, optional follow-ups. ## Upstream improvements (do-it-well, if time allows) From d67126090411ca7e2776808dd77e5d92ffe46f54 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:21:23 +0100 Subject: [PATCH 3/8] docs: refocus observability design on eoAPI; EOEPCA = config-only Center the deliverable on eoAPI (the eoapi-k8s chart) as a generic, opt-in capability for any Prometheus-Operator + Grafana platform, with EOEPCA reached via a few Helm-values changes rather than bespoke code. - Add a "Provided by" column splitting what eoAPI ships vs what the platform provides (the heart of the contract). - Add an explicit "EOEPCA integration = a few config changes" section. - Scope the ServiceMonitor to the later app-/metrics increment (U1/U3): the 15h delivery reads ingress metrics already scraped, so it needs only dashboard panels + alert rules in the chart. Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 112 +++++++++++++-------- 1 file changed, 69 insertions(+), 43 deletions(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index b6bdcd0..5a6831f 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -1,4 +1,4 @@ -# Monitoring & Logging integration with the Operations BB +# eoAPI Monitoring & Logging (Operations BB integration) > **Status: Draft for discussion.** Focused first-delivery design for > [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), intentionally @@ -19,56 +19,82 @@ Block."* So this delivery is fundamentally **clarification + reuse of what already exists**, not a new instrumentation stack. Concretely it produces: -1. A written **contract** describing what eoAPI exposes and how the Operations BB consumes it. -2. A thin **demonstration** that makes that contract real: eoAPI service health visible in the - Operations BB's existing Grafana, driven by metrics that already exist, plus a few alert rules - — shipped through the eoapi-k8s chart and enabled on EOEPCA by values only. +1. A written **contract** describing what eoAPI exposes and what a consuming platform provides. +2. A thin **demonstration** that makes that contract real: eoAPI service health visible in a + Grafana that already exists, driven by metrics that already exist, plus a few alert rules. -**Primary target: eoapi-k8s** (the chart we maintain), then a values-only adaptation for the -EOEPCA deployment. +**The work is delivered in eoAPI itself — the `eoapi-k8s` chart we maintain — as a generic, +opt-in capability** that works with any Prometheus-Operator + Grafana platform. **EOEPCA is +then reached with a few configuration (Helm values) changes**, not bespoke code. Doing it in the +chart (rather than as an EOEPCA-only overlay) keeps it reusable and low-maintenance for every +eoAPI deployment, EOEPCA included. ## What already exists (and is therefore reused, not rebuilt) -| Layer | Already present | Source | +The split below is the heart of the contract: a few things ship with **eoAPI** (the `eoapi-k8s` +chart, generic), and the rest is provided by the **consuming platform** (here, EOEPCA's +Operations BB — but any Prometheus-Operator + Grafana stack works). + +| Capability | Provided by | Already present | |---|---|---| -| **Request metrics** | Per-service / per-route request counts (by status) and request-duration histograms at the ingress | ingress controller metrics, already scraped by the cluster Prometheus (the chart's autoscaling uses them) | -| **Metrics backend** | kube-prometheus-stack (Prometheus + Grafana + Alertmanager), 30d / 50Gi retention; `ServiceMonitor`/`PodMonitor` is the scrape path (`scrapeConfig` disabled) | `eoepca-plus` `argocd/operations/monitoring/` | -| **Logs** | Alloy DaemonSet auto-collects **every pod's stdout → Loki**; Grafana has a Loki datasource (`uid: loki`) | `eoepca-plus` `argocd/operations/monitoring/alloy/` | -| **Dashboards** | eoapi-k8s ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) | `charts/eoapi/templates/monitoring/observability.yaml` | -| **Grafana access** | SSO/OIDC with RBAC roles | `eoepca-plus` kube-prometheus-stack values | +| **Grafana dashboard + delivery mechanism** | **eoAPI** (`eoapi-k8s`) | Chart ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) — `charts/eoapi/templates/monitoring/observability.yaml` | +| **Request metrics consumption** | **eoAPI** (`eoapi-k8s`) | Chart already consumes per-service request-rate metrics for autoscaling (HPA), so the query shapes are known | +| **Request metrics source** | **Platform** | Per-service / per-route request counts (by status) + request-duration histograms from the ingress controller | +| **Metrics backend** | **Platform** | Prometheus-Operator (e.g. kube-prometheus-stack: Prometheus + Grafana + Alertmanager); `ServiceMonitor`/`PodMonitor` is the scrape path. *(EOEPCA: 30d / 50Gi, `scrapeConfig` disabled.)* | +| **Logs** | **Platform** | Log shipper auto-collects pod stdout → Loki. *(EOEPCA: Alloy DaemonSet; Grafana Loki datasource `uid: loki`.)* | +| **Grafana access** | **Platform** | Auth/RBAC. *(EOEPCA: SSO/OIDC with roles.)* | The gap is small: the bundled eoAPI dashboard is resource/autoscaling-oriented (CPU/mem/pods + request rate). It lacks operator-facing **error-rate** and **latency-percentile per service** -views, and there are no eoAPI-specific alert rules. That gap is what this delivery fills. - -## The contract — eoAPI ↔ Operations BB (the acceptance artifact) - -**What eoAPI exposes** -- **Metrics:** Prometheus-format metrics. Today these are request rate / errors / latency at the - **ingress layer** (per service and route-prefix), already scraped by the cluster Prometheus. - (App-level metrics — DB pool, internal latency — are a documented later increment; see - *Upstream improvements*.) -- **Logs:** structured logs to **stdout**, collected by the platform's Alloy → Loki pipeline. No - log push from eoAPI; the platform owns collection. -- **Standards:** Prometheus exposition for metrics; OpenTelemetry is the chosen standard for the - later traces increment (backend-neutral via OTLP). - -**What the Operations BB provides / owns** -- Scrapes eoAPI metrics via the existing kube-prometheus-stack (`ServiceMonitor`/`PodMonitor`). -- Stores and visualizes metrics (Prometheus + Grafana) and logs (Loki, fed by Alloy). -- Owns **retention, access control (SSO/RBAC), and alert routing** (Alertmanager). - -**Shared / handed over by this delivery** -- An "eoAPI operations" **Grafana dashboard** (shipped as a ConfigMap by the chart, imported by - the Operations BB's Grafana sidecar). -- A small set of **PrometheusRules** (availability / error-rate / latency), opt-in via chart - values, wired into the existing Alertmanager. +views, and eoAPI ships no alert rules. **That gap — dashboard panels + alert rules, in the +chart — is what this delivery fills**, generically. (No new scrape target is needed: it reads +ingress metrics the platform already collects.) + +## The contract — what eoAPI provides vs. what the platform provides + +This is the acceptance artifact. It is written for **eoAPI in general**; the Operations BB is the +first consumer. + +**What eoAPI provides (in the `eoapi-k8s` chart)** +- A **Grafana dashboard** ("eoAPI operations": request rate, error rate by status, latency + p50/p95/p99 per service) shipped as a ConfigMap, with a **templated datasource** so it imports + into any Grafana. +- A small set of opt-in **PrometheusRules** (availability / error-rate / latency), so any + Prometheus-Operator platform alerts on eoAPI without bespoke wiring. *(An opt-in + `ServiceMonitor` is added later, once eoAPI exposes its own `/metrics` — see U1/U3 in + Upstream improvements; the first delivery needs none, because it reads ingress metrics the + platform already scrapes.)* +- **Metrics** in Prometheus format (today: request rate / errors / latency at the ingress, per + service and route-prefix; app-level metrics are a later increment — see *Upstream improvements*). +- **Logs** to **stdout** in a structured form, for the platform's log shipper to collect. eoAPI + does not push logs. +- **Standards:** Prometheus exposition for metrics; OpenTelemetry (backend-neutral OTLP) is the + chosen standard for the later traces increment. + +**What the consuming platform provides (e.g. EOEPCA's Operations BB)** +- A **Prometheus-Operator** stack (scrapes targets via `ServiceMonitor`/`PodMonitor`; today it + already scrapes the ingress controller's metrics), plus **Grafana** (with a dashboard sidecar) + and **Alertmanager**. +- A **log pipeline** (shipper → Loki) that collects pod stdout. +- **Retention, access control, and alert routing.** **Boundaries (explicit)** -- eoAPI does **not** run its own Prometheus/Grafana/Loki in the cluster — it integrates with the - Operations BB's. The chart's bundled monitoring components stay disabled on EOEPCA. +- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's. + The chart's bundled monitoring components stay disabled when a platform stack is present. - Metric label cardinality is bounded by design (no per-collection / per-tile labels in this - delivery) to respect the cluster's 30d/50Gi budget. + delivery) to respect platform budgets (EOEPCA: 30d / 50Gi). + +### EOEPCA integration = a few config changes +Because the capability lives in the chart, bringing it to EOEPCA is **values only** in +`eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): +1. **Enable** the chart's observability extension (dashboard ConfigMap + PrometheusRules). +2. **Match the dashboard sidecar** label/namespace so EOEPCA's Grafana imports it + (`eoapi_dashboard` → confirm against the cluster's sidecar config). +3. **Confirm the scrape target** (ingress metric source: nginx vs APISIX) and leave the chart's + bundled Prometheus **disabled** (the cluster stack is used instead). + +No image rebuilds, no application changes — ArgoCD syncs the values and the dashboard/rules +appear in the existing Grafana/Prometheus. ## First delivery — scope and effort (~15h) @@ -76,9 +102,9 @@ views, and there are no eoAPI-specific alert rules. That gap is what this delive |---|---|---| | 1 | Write this contract (the acceptance artifact) and circulate for Operations BB review | 3h | | 2 | Confirm on `develop` whether eoAPI is fronted by **nginx-ingress or APISIX**, that request-duration histograms are scraped, and capture exact metric names | 2h | -| 3 | Build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h | -| 4 | Add **opt-in PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h | -| 5 | **EOEPCA enablement = values only** — PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`) enabling the dashboard + rules and confirming the scrape | 2h | +| 3 | **In `eoapi-k8s` (generic):** build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h | +| 4 | **In `eoapi-k8s` (generic):** add opt-in **PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h | +| 5 | **EOEPCA integration = a few config changes** — values PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): enable the extension, match the dashboard sidecar, confirm the scrape | 2h | | 6 | Finalize this doc + a short **#202 note**; hand off for Operations BB sign-off | 2h | | | **Total** | **15h** | From 127f9e67024a5bd86379f13a09efedb42be9b052 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:41:29 +0100 Subject: [PATCH 4/8] docs: align observability design with the Operations BB spec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reviewed against the Operations BB docs (STAC Scenario, ServiceMonitors, Alerting & SLOs) and corrected the design: - Ingress is APISIX, not nginx — the chart's nginx_ingress_controller_* metrics/autoscaling don't exist on EOEPCA; document APISIX (apisix_http_status / apisix_http_latency) + postgres-exporter + Keep. - The BB already has gateway/DB/log/SLO machinery; the named gap is native app metrics (/metrics returns 404 on eoapi-stac and stac-auth-proxy; no data-access ServiceMonitor). - Promote native, bounded /metrics + ServiceMonitor + dashboard/alert panels (the STAC slice) into the first delivery; this matches the BB's stated desired end state and feeds the existing 99%/500ms burn-rate SLO. - Keep bounded-cardinality, Alloy logs, platform ownership; defer traces (no Tempo) and per-collection. Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 240 ++++++++++----------- 1 file changed, 116 insertions(+), 124 deletions(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index 5a6831f..3c54e4e 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -1,9 +1,10 @@ # eoAPI Monitoring & Logging (Operations BB integration) > **Status: Draft for discussion.** Focused first-delivery design for -> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), intentionally -> scoped to a single ~15h increment. Broader observability work (full OpenTelemetry: traces, -> per-collection analytics) is captured as future roadmap in the *Out of scope* section below. +> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), aligned with the +> [Operations BB documentation](https://eoepca.readthedocs.io/projects/operations/en/latest/) +> (in particular its *STAC Scenario*, *ServiceMonitors*, and *Alerting and SLOs* pages). +> Scoped to a single increment (~15h). Tracing and per-collection analytics are future roadmap. ## Goal & acceptance criterion @@ -12,142 +13,133 @@ data-access#202 asks for one thing: > *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging > capabilities of eoAPI."* -and notes that *"eoAPI already collects metrics and offers observability tools such as Grafana. -The goal of this ticket is to support that these metrics become part of the Operations Building -Block."* +The Operations BB's **STAC Scenario** already states the concrete expectation, so the +"clarification" is really *"adopt what the Operations BB already specifies and close the gap it +names."* This delivery therefore produces: -So this delivery is fundamentally **clarification + reuse of what already exists**, not a new -instrumentation stack. Concretely it produces: +1. A written **contract** — what eoAPI exposes vs. what the Operations BB provides. +2. The **first concrete step** of the BB's stated desired end state: native, bounded application + metrics from `eoapi-stac` and `eoapi-stac-auth-proxy`, a `ServiceMonitor`, and dashboard/alert + panels built on them. -1. A written **contract** describing what eoAPI exposes and what a consuming platform provides. -2. A thin **demonstration** that makes that contract real: eoAPI service health visible in a - Grafana that already exists, driven by metrics that already exist, plus a few alert rules. +**Delivered in eoAPI itself** — the apps (`stac-fastapi-pgstac`, `stac-auth-proxy`) and the +`eoapi-k8s` chart we maintain — as a **generic, opt-in capability**. **EOEPCA is then reached +with a few Helm-values changes.** Doing it upstream (not as an EOEPCA-only overlay) keeps it +reusable and low-maintenance for every eoAPI deployment. -**The work is delivered in eoAPI itself — the `eoapi-k8s` chart we maintain — as a generic, -opt-in capability** that works with any Prometheus-Operator + Grafana platform. **EOEPCA is -then reached with a few configuration (Helm values) changes**, not bespoke code. Doing it in the -chart (rather than as an EOEPCA-only overlay) keeps it reusable and low-maintenance for every -eoAPI deployment, EOEPCA included. +## How the Operations BB monitors eoAPI today -## What already exists (and is therefore reused, not rebuilt) +From the Operations BB docs, the request path is +`client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack +is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There +is no tracing backend (no Tempo).** What already exists: -The split below is the heart of the contract: a few things ship with **eoAPI** (the `eoapi-k8s` -chart, generic), and the rest is provided by the **consuming platform** (here, EOEPCA's -Operations BB — but any Prometheus-Operator + Grafana stack works). - -| Capability | Provided by | Already present | +| Signal | Source today | Notes | |---|---|---| -| **Grafana dashboard + delivery mechanism** | **eoAPI** (`eoapi-k8s`) | Chart ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) — `charts/eoapi/templates/monitoring/observability.yaml` | -| **Request metrics consumption** | **eoAPI** (`eoapi-k8s`) | Chart already consumes per-service request-rate metrics for autoscaling (HPA), so the query shapes are known | -| **Request metrics source** | **Platform** | Per-service / per-route request counts (by status) + request-duration histograms from the ingress controller | -| **Metrics backend** | **Platform** | Prometheus-Operator (e.g. kube-prometheus-stack: Prometheus + Grafana + Alertmanager); `ServiceMonitor`/`PodMonitor` is the scrape path. *(EOEPCA: 30d / 50Gi, `scrapeConfig` disabled.)* | -| **Logs** | **Platform** | Log shipper auto-collects pod stdout → Loki. *(EOEPCA: Alloy DaemonSet; Grafana Loki datasource `uid: loki`.)* | -| **Grafana access** | **Platform** | Auth/RBAC. *(EOEPCA: SSO/OIDC with roles.)* | - -The gap is small: the bundled eoAPI dashboard is resource/autoscaling-oriented (CPU/mem/pods + -request rate). It lacks operator-facing **error-rate** and **latency-percentile per service** -views, and eoAPI ships no alert rules. **That gap — dashboard panels + alert rules, in the -chart — is what this delivery fills**, generically. (No new scrape target is needed: it reads -ingress metrics the platform already collects.) +| **Gateway metrics** | **APISIX** `/apisix/prometheus/metrics`, scraped by a `ServiceMonitor` in `ingress-apisix` | `apisix_http_status` (by code/route/method) and `apisix_http_latency` (`type=request\|upstream\|apisix`) — bounded labels; three latency views | +| **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` | +| **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads | +| **Synthetic checks** | external black-box probe | validates the public STAC endpoint | +| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records (e.g. `stac_get_latency_500ms_burn_rate_1h`, critical at `> 14.4`); GET/POST request records are the alert basis, with gateway/DB as diagnosis | + +> **Important correction:** EOEPCA's ingress is **APISIX, not nginx-ingress.** The `eoapi-k8s` +> chart's autoscaling/dashboard use `nginx_ingress_controller_*` metrics, which **do not exist** +> on EOEPCA. Any eoAPI-shipped dashboard must therefore use APISIX (or native app) metrics. + +### The gap the Operations BB names +- `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`. +- There are **no `ServiceMonitor` objects in the `data-access` namespace**. +- Alerts/SLOs rely on **indirect gateway signals**; the BB explicitly wants *"semantic, + low-cardinality metrics from the application itself, then scrape them with a `ServiceMonitor`"* + to measure request rate, errors and latency **directly from the service**. + +So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application +metrics**. That is what this delivery contributes. ## The contract — what eoAPI provides vs. what the platform provides -This is the acceptance artifact. It is written for **eoAPI in general**; the Operations BB is the -first consumer. - -**What eoAPI provides (in the `eoapi-k8s` chart)** -- A **Grafana dashboard** ("eoAPI operations": request rate, error rate by status, latency - p50/p95/p99 per service) shipped as a ConfigMap, with a **templated datasource** so it imports - into any Grafana. -- A small set of opt-in **PrometheusRules** (availability / error-rate / latency), so any - Prometheus-Operator platform alerts on eoAPI without bespoke wiring. *(An opt-in - `ServiceMonitor` is added later, once eoAPI exposes its own `/metrics` — see U1/U3 in - Upstream improvements; the first delivery needs none, because it reads ingress metrics the - platform already scrapes.)* -- **Metrics** in Prometheus format (today: request rate / errors / latency at the ingress, per - service and route-prefix; app-level metrics are a later increment — see *Upstream improvements*). -- **Logs** to **stdout** in a structured form, for the platform's log shipper to collect. eoAPI - does not push logs. -- **Standards:** Prometheus exposition for metrics; OpenTelemetry (backend-neutral OTLP) is the - chosen standard for the later traces increment. - -**What the consuming platform provides (e.g. EOEPCA's Operations BB)** -- A **Prometheus-Operator** stack (scrapes targets via `ServiceMonitor`/`PodMonitor`; today it - already scrapes the ingress controller's metrics), plus **Grafana** (with a dashboard sidecar) - and **Alertmanager**. -- A **log pipeline** (shipper → Loki) that collects pod stdout. -- **Retention, access control, and alert routing.** - -**Boundaries (explicit)** -- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's. - The chart's bundled monitoring components stay disabled when a platform stack is present. -- Metric label cardinality is bounded by design (no per-collection / per-tile labels in this - delivery) to respect platform budgets (EOEPCA: 30d / 50Gi). +This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first +consumer. + +| Capability | Provided by | Detail | +|---|---|---| +| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template, method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only. | +| **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. | +| **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. | +| **Structured logs** | **eoAPI** | Logs to **stdout** for the platform's shipper; eoAPI does not push logs. | +| **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). | +| **DB metrics** | **Platform** | postgres-exporter. | +| **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. | +| **Log pipeline** | **Platform** | Alloy → Loki. | + +**Boundaries** +- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's; the + chart's bundled monitoring components stay disabled where a platform stack exists. +- **Cardinality is bounded by design** — route template / method / status class only; **no full + URLs, no per-collection / per-tile labels** (the BB calls out high-cardinality churn explicitly; + EOEPCA has been bitten before). Per-collection analytics are out of scope (see roadmap). ### EOEPCA integration = a few config changes -Because the capability lives in the chart, bringing it to EOEPCA is **values only** in +Because the capability lives in the apps + chart, bringing it to EOEPCA is **values only** in `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): -1. **Enable** the chart's observability extension (dashboard ConfigMap + PrometheusRules). -2. **Match the dashboard sidecar** label/namespace so EOEPCA's Grafana imports it - (`eoapi_dashboard` → confirm against the cluster's sidecar config). -3. **Confirm the scrape target** (ingress metric source: nginx vs APISIX) and leave the chart's - bundled Prometheus **disabled** (the cluster stack is used instead). +1. **Enable** native metrics (env) on `eoapi-stac` / `eoapi-stac-auth-proxy`. +2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana + sidecar label/namespace (`eoapi_dashboard`). +3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new + ServiceMonitor). Over time, point the existing STAC burn-rate records at the **native + request metrics** instead of the indirect gateway signal. -No image rebuilds, no application changes — ArgoCD syncs the values and the dashboard/rules -appear in the existing Grafana/Prometheus. +No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values. -## First delivery — scope and effort (~15h) +## First delivery — scope and effort (~15h, the STAC slice) -| # | Task | Effort | -|---|---|---| -| 1 | Write this contract (the acceptance artifact) and circulate for Operations BB review | 3h | -| 2 | Confirm on `develop` whether eoAPI is fronted by **nginx-ingress or APISIX**, that request-duration histograms are scraped, and capture exact metric names | 2h | -| 3 | **In `eoapi-k8s` (generic):** build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h | -| 4 | **In `eoapi-k8s` (generic):** add opt-in **PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h | -| 5 | **EOEPCA integration = a few config changes** — values PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): enable the extension, match the dashboard sidecar, confirm the scrape | 2h | -| 6 | Finalize this doc + a short **#202 note**; hand off for Operations BB sign-off | 2h | -| | **Total** | **15h** | - -No new container images, no application code changes, and no new backend services — that is what -keeps this within 15h. - -## Out of scope (future roadmap, not this delivery) -- OpenTelemetry auto-instrumentation (DB spans, internal latency). -- Distributed **traces** and trace↔log correlation — also blocked by **Tempo not being deployed** - on EOEPCA today. -- **Per-collection analytics** (the eoAPI#193 headline) — needs bounded custom code and a - cardinality budget agreed with the Operations BB. - -These are sketched in *Upstream improvements* below as costed, optional follow-ups. - -## Upstream improvements (do-it-well, if time allows) - -Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home for these is upstream — not -a per-deployment overlay. These are **separate allocations**, not part of the 15h. Estimates are -engineering hours (tests/docs/PR/release included); upstream PRs also carry review/CI/release -latency beyond these hours. - -| # | Upstream item | Effort | +The STAC Scenario is the BB's worked example, so the first increment targets it end-to-end. + +| # | Task | Where | Effort | +|---|---|---|---| +| 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h | +| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template, method, status class) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h | +| 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h | +| 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h | +| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how native records feed the existing 99%/500ms burn-rate rules) | docs | 1h | +| | **Total** | | **15h** | + +**Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app +release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the +clarification (the #202 AC) can land first while the `/metrics` PRs go through. + +## EOEPCA enablement (values only) +A follow-on values PR to `eoepca-plus` enables metrics + ServiceMonitor + dashboard and confirms +the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/chart work.) + +## Out of scope (future roadmap) +- **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac), + vector (tipg), multidim. +- **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA. +- **Per-collection analytics** (eoAPI#193) — high cardinality; needs a bounded design and a + budget agreed with the Operations BB. + +## Upstream improvements (do-it-well, beyond the first delivery) + +Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home is upstream. Estimates are +engineering hours (tests/docs included); upstream PRs also carry review/CI/release latency. + +| # | Item | Effort | |---|---|---| -| U1 | **eoapi-k8s** — opt-in `observability` extension: `ServiceMonitor` template + values toggle, and the ops panels folded into the bundled dashboard | 6–9h | -| U2 | **eoapi-k8s** — shared `OTEL_*` env passthrough scaffolding (a `telemetry` block) | 3–5h | -| U3 | **eoAPI apps** (titiler-pgstac / stac-fastapi-pgstac / tipg) — opt-in native Prometheus `/metrics` endpoint, off by default (app-level latency/error + DB-pool metrics) | 4–6h per app (~12–18h all three) | -| U4 | **eoAPI apps** — opt-in OpenTelemetry baked into images (env-driven, off by default); riskiest (dependency-conflict + multi-worker validation); only worthwhile once traces are wanted and Tempo exists | 5–7h per app (~15–20h all three) | -| U5 | **eoAPI apps** — bounded, opt-in per-collection metric (#193) with cardinality guards + dashboard panel | 8–14h | - -**Suggested packaging:** -- *Phase A (~10–15h)* — U1 + U3 (one app): make app-level metrics first-class in the chart and - one app. A natural second ~15h slot after this delivery. -- *Phase B (~12–18h)* — U3 across the remaining apps. -- *Phase C (~20–25h)* — U2 + U4 (traces), gated on Tempo being deployed on EOEPCA. -- *Phase D (~8–14h)* — U5 (per-collection), gated on a cardinality budget agreed with Ops BB. +| U1 | Extend native `/metrics` to **raster / vector / multidim** (same pattern as STAC) | 4–6h per app | +| U2 | First-class `eoapi-k8s` **`observability`/`telemetry` values block** (toggle metrics + ServiceMonitor + dashboard + standard `OTEL_*` env passthrough) | 6–9h | +| U3 | Opt-in **OpenTelemetry / traces** baked into images (env-driven, off by default); riskiest (dependency-conflict + uvicorn multi-worker `WEB_CONCURRENCY` validation); only once traces are wanted **and** Tempo exists | 5–7h per app | +| U4 | Bounded, opt-in **per-collection metric** (#193) with cardinality guards + dashboard panel | 8–14h | ## Verification / acceptance -- **AC met:** the contract above is reviewed and accepted by the Operations BB — the single #202 - checkbox. -- **Demonstration:** on `develop`, eoAPI service health (request rate / error rate / latency - percentiles per service) is visible in the **existing** Grafana via the shipped dashboard; the - alert rules load; logs are queryable in the **existing** Loki — with no new images and no app - changes. -- **Metric-source check (task 2)** confirms nginx vs APISIX and that duration histograms are - scraped before the dashboard queries are finalized. +- **AC met:** the contract is reviewed and accepted by the Operations BB — the #202 checkbox. +- **Gap closed:** `GET /metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy` returns **200** + (not 404), exposing bounded, route-templated series. +- **Scraped:** the `ServiceMonitor` is picked up by the cluster Prometheus + (`kubectl get servicemonitor -n data-access`; series present). +- **Dashboard:** native rate / error / latency-percentile per-service panels populate in the + existing Grafana via the shipped ConfigMap. +- **SLO:** native request records can feed the existing **99% < 500ms** burn-rate rules + (`stac_get_latency_500ms_*`), replacing the indirect gateway signal. +- **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them). +- **Cardinality check:** no full-URL / per-collection labels on any new series. From f4618ffc9cfa69c7f8af6e8d1576cd64c1fd4229 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:47:56 +0100 Subject: [PATCH 5/8] docs: fold in Operations BB demo insights MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From the Ops BB (Versioneer/ESA) STAC SLO demo: - The missing signal is operation-level latency (search vs item-listing vs collection-listing); URL-as-label is the cardinality trap they backed off — native route-template metrics solve it safely. - POST /search is the latency-critical path the SLO alert targets. - Agreed minimum bar: enable the framework's out-of-the-box metrics. - Alerting is SLO/user-impact-centric, not CPU/memory. - Keep enrichment correlates gateway/app/DB latency; native app-layer latency makes the "is it the app?" branch accurate. - Per-collection (e.g. only VHR slow) is a real but deeper, deferred need. Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 29 +++++++++++++++++++--- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index 3c54e4e..d73613d 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -56,6 +56,26 @@ is no tracing backend (no Tempo).** What already exists: So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application metrics**. That is what this delivery contributes. +### Design implications from the Operations BB demo +The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow. Points that shape this design: + +- **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC + request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only + sees GET/POST, and putting the **URL in a label is a cardinality problem they tried and backed + off**. Native metrics keyed by **route template (= operation)** close this safely. +- **POST `/search` is the latency-critical path** ("most of the stuff that takes longer is a + POST"); the SLO burn-rate alert is on STAC POST latency. Instrument and surface it prominently. +- **Minimum bar = "what the framework gives for free."** ESA and the BB owner agreed the baseline + is enabling the out-of-the-box FastAPI route/method/latency-bucket metrics (a library + little + code) — exactly the opt-in `/metrics` proposed here. +- **Alerting is SLO / user-impact-centric, not CPU/memory.** Per-workload CPU/mem already comes + for free and is *not* alerted on; eoAPI's contribution is the RED signals behind the SLO and the + dashboard the alert links to. +- **Keep enrichment correlates gateway vs app vs DB latency** to localize the bottleneck (e.g. + "DB mean 11 ms → not the DB; app burn 2.2 → app fine"). Native **app-layer** latency makes the + "is it the app?" branch accurate instead of inferred from APISIX upstream latency. eoAPI's + metrics/rules should carry **stable labels** so Keep can pull and correlate them. + ## The contract — what eoAPI provides vs. what the platform provides This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first @@ -63,7 +83,7 @@ consumer. | Capability | Provided by | Detail | |---|---|---| -| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template, method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only. | +| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. | | **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. | | **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. | | **Structured logs** | **eoAPI** | Logs to **stdout** for the platform's shipper; eoAPI does not push logs. | @@ -98,7 +118,7 @@ The STAC Scenario is the BB's worked example, so the first increment targets it | # | Task | Where | Effort | |---|---|---|---| | 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h | -| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template, method, status class) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h | +| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h | | 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h | | 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h | | 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how native records feed the existing 99%/500ms burn-rate rules) | docs | 1h | @@ -116,8 +136,9 @@ the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/char - **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac), vector (tipg), multidim. - **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA. -- **Per-collection analytics** (eoAPI#193) — high cardinality; needs a bounded design and a - budget agreed with the Operations BB. +- **Per-collection analytics** (eoAPI#193) — a real root-cause need (the Ops BB demo's example: + "only the VHR data is slow"), but high cardinality and needs service-specific knowledge + a + bounded design and budget agreed with the Operations BB. ## Upstream improvements (do-it-well, beyond the first delivery) From 578ca2a0917b2e261b4914ff36ab80f29e45f586 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:52:01 +0100 Subject: [PATCH 6/8] docs: phrase observability design as a standalone first version Drop revision-style framing so the document reads as an original proposal (no implied earlier published version). Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index d73613d..fea539f 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -42,9 +42,9 @@ is no tracing backend (no Tempo).** What already exists: | **Synthetic checks** | external black-box probe | validates the public STAC endpoint | | **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records (e.g. `stac_get_latency_500ms_burn_rate_1h`, critical at `> 14.4`); GET/POST request records are the alert basis, with gateway/DB as diagnosis | -> **Important correction:** EOEPCA's ingress is **APISIX, not nginx-ingress.** The `eoapi-k8s` -> chart's autoscaling/dashboard use `nginx_ingress_controller_*` metrics, which **do not exist** -> on EOEPCA. Any eoAPI-shipped dashboard must therefore use APISIX (or native app) metrics. +> **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s` +> chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not +> present** on EOEPCA, so any eoAPI-shipped dashboard uses APISIX (or native app) metrics. ### The gap the Operations BB names - `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`. From baeb4e62e501170d528cfbb6a9d7db940d2ad3cf Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 15:55:17 +0100 Subject: [PATCH 7/8] docs: link the EOEPCA+ demo (8 May 2026) in the observability design Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index fea539f..39e9ec0 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -57,7 +57,9 @@ So the gateway/DB/log/SLO machinery already exists — **the missing piece is na metrics**. That is what this delivery contributes. ### Design implications from the Operations BB demo -The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow. Points that shape this design: +The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow at the +[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link). +Points that shape this design: - **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only From 96f0a28e4704519db18adb8484f3b760debc5a0f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Houpert?= <10154151+lhoupert@users.noreply.github.com> Date: Thu, 4 Jun 2026 16:05:25 +0100 Subject: [PATCH 8/8] docs: tighten observability design after review - Clarify GET vs POST burn-rate records; POST /search is the critical path. - Note /metrics is cluster-internal only (port 8080), not via public APISIX ingress (matters for the auth proxy). - Make burn-rate rebasing the Ops BB's decision; eoAPI only exposes signals. - Soften the logs row (collected as-is; structured logging is later). - Add deployed-name <-> upstream-package mapping and cross-repo dependency order; flag the demo Drive link as access-restricted. Co-Authored-By: Claude Opus 4.8 --- docs/design/observability-operations-bb.md | 35 ++++++++++++++-------- 1 file changed, 23 insertions(+), 12 deletions(-) diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md index 39e9ec0..6e3c6d9 100644 --- a/docs/design/observability-operations-bb.md +++ b/docs/design/observability-operations-bb.md @@ -32,7 +32,12 @@ reusable and low-maintenance for every eoAPI deployment. From the Operations BB docs, the request path is `client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There -is no tracing backend (no Tempo).** What already exists: +is no tracing backend (no Tempo).** + +> Deployed-service ↔ upstream-package names: **`eoapi-stac`** = `stac-fastapi-pgstac`, +> **`eoapi-stac-auth-proxy`** = `stac-auth-proxy` (both FastAPI/Starlette apps we maintain). + +What already exists: | Signal | Source today | Notes | |---|---|---| @@ -40,7 +45,7 @@ is no tracing backend (no Tempo).** What already exists: | **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` | | **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads | | **Synthetic checks** | external black-box probe | validates the public STAC endpoint | -| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records (e.g. `stac_get_latency_500ms_burn_rate_1h`, critical at `> 14.4`); GET/POST request records are the alert basis, with gateway/DB as diagnosis | +| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records exist for **both GET and POST** (`stac_get_latency_500ms_burn_rate_1h` / `stac_post_latency_500ms_burn_rate_1h`, critical at `> 14.4`). **POST `/search` is the latency-critical path**; gateway/DB records serve as diagnosis | > **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s` > chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not @@ -57,9 +62,9 @@ So the gateway/DB/log/SLO machinery already exists — **the missing piece is na metrics**. That is what this delivery contributes. ### Design implications from the Operations BB demo -The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow at the -[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link). -Points that shape this design: +The Operations BB team demoed the STAC SLO workflow at the +[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link) +(recording is access-restricted Google Drive). Points that shape this design: - **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only @@ -85,10 +90,10 @@ consumer. | Capability | Provided by | Detail | |---|---|---| -| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. | +| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`, on the **existing app port (8080), cluster-internal only** (scraped by the ServiceMonitor; **not** routed through the public APISIX ingress). Request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. | | **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. | | **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. | -| **Structured logs** | **eoAPI** | Logs to **stdout** for the platform's shipper; eoAPI does not push logs. | +| **Logs** | **eoAPI** | Logs to **stdout**, collected **as-is** by the platform's shipper (Alloy→Loki); eoAPI does not push logs. Structured/JSON logging is a later increment, not part of this delivery. | | **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). | | **DB metrics** | **Platform** | postgres-exporter. | | **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. | @@ -108,8 +113,8 @@ Because the capability lives in the apps + chart, bringing it to EOEPCA is **val 2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana sidecar label/namespace (`eoapi_dashboard`). 3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new - ServiceMonitor). Over time, point the existing STAC burn-rate records at the **native - request metrics** instead of the indirect gateway signal. + ServiceMonitor). The Ops BB may then **choose** to rebase its STAC burn-rate records onto the + native request metrics — that is the BB's decision; eoAPI only exposes the signal. No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values. @@ -123,9 +128,14 @@ The STAC Scenario is the BB's worked example, so the first increment targets it | 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h | | 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h | | 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h | -| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how native records feed the existing 99%/500ms burn-rate rules) | docs | 1h | +| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how the native series can back the existing 99%/500ms SLO — the BB owns whether to rebase its rules) | docs | 1h | | | **Total** | | **15h** | +**Dependency order (critical path):** app `/metrics` released (tasks 2–3, repos +`stac-fastapi-pgstac` + `stac-auth-proxy`) → chart `ServiceMonitor` + dashboard (task 4, +`eoapi-k8s`) → EOEPCA values enablement (`eoepca-plus`). The clarification (task 1) and the +chart work (task 4) can proceed in parallel. + **Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the clarification (the #202 AC) can land first while the `/metrics` PRs go through. @@ -162,7 +172,8 @@ engineering hours (tests/docs included); upstream PRs also carry review/CI/relea (`kubectl get servicemonitor -n data-access`; series present). - **Dashboard:** native rate / error / latency-percentile per-service panels populate in the existing Grafana via the shipped ConfigMap. -- **SLO:** native request records can feed the existing **99% < 500ms** burn-rate rules - (`stac_get_latency_500ms_*`), replacing the indirect gateway signal. +- **SLO:** the native request series are suitable to back the existing **99% < 500ms** SLO + (`stac_get_latency_500ms_*` / `stac_post_latency_500ms_*`); whether the Ops BB rebases its + burn-rate rules onto them, instead of the indirect gateway signal, is the BB's decision. - **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them). - **Cardinality check:** no full-URL / per-collection labels on any new series.