From 626883211610a4271c572e846ef826fded4cb547 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:10:10 +0100
Subject: [PATCH 1/8] =?UTF-8?q?docs:=20add=20focused=20Monitoring=20&=20Lo?=
 =?UTF-8?q?gging=20=E2=86=94=20Operations=20BB=20design=20(data-access#202?=
 =?UTF-8?q?)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a narrow, ~15h-scoped first-delivery design for #202: clarify the
eoAPI ↔ Operations BB monitoring/logging contract and make eoAPI's
existing metrics part of the Operations BB by reusing what already
exists (ingress metrics already scraped by kube-prometheus-stack, logs
already shipped by Alloy, the chart's existing dashboard ConfigMap
mechanism) — no new images, app changes, or backends.

Includes the integration contract, a 15h task breakdown, an explicit
out-of-scope/roadmap pointer, and costed upstream-improvement proposals
for eoapi-k8s and the eoAPI apps. Registered under Design in the nav.
Leaves the broader observability roadmap doc untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 128 +++++++++++++++++++++
 mkdocs.yml                                 |   1 +
 2 files changed, 129 insertions(+)
 create mode 100644 docs/design/observability-operations-bb.md

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
new file mode 100644
index 0000000..d571469
--- /dev/null
+++ b/docs/design/observability-operations-bb.md
@@ -0,0 +1,128 @@
+# Monitoring & Logging integration with the Operations BB
+
+> **Status: Draft for discussion.** Focused first-delivery design for
+> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202). A broader,
+> longer-term observability roadmap (full OpenTelemetry: traces, per-collection analytics) is
+> tracked separately in [observability.md](./observability.md); this document is intentionally
+> narrower and scoped to a single ~15h increment.
+
+## Goal & acceptance criterion
+
+data-access#202 asks for one thing:
+
+> *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging
+> capabilities of eoAPI."*
+
+and notes that *"eoAPI already collects metrics and offers observability tools such as Grafana.
+The goal of this ticket is to support that these metrics become part of the Operations Building
+Block."*
+
+So this delivery is fundamentally **clarification + reuse of what already exists**, not a new
+instrumentation stack. Concretely it produces:
+
+1. A written **contract** describing what eoAPI exposes and how the Operations BB consumes it.
+2. A thin **demonstration** that makes that contract real: eoAPI service health visible in the
+   Operations BB's existing Grafana, driven by metrics that already exist, plus a few alert rules
+   — shipped through the eoapi-k8s chart and enabled on EOEPCA by values only.
+
+**Primary target: eoapi-k8s** (the chart we maintain), then a values-only adaptation for the
+EOEPCA deployment.
+
+## What already exists (and is therefore reused, not rebuilt)
+
+| Layer | Already present | Source |
+|---|---|---|
+| **Request metrics** | Per-service / per-route request counts (by status) and request-duration histograms at the ingress | ingress controller metrics, already scraped by the cluster Prometheus (the chart's autoscaling uses them) |
+| **Metrics backend** | kube-prometheus-stack (Prometheus + Grafana + Alertmanager), 30d / 50Gi retention; `ServiceMonitor`/`PodMonitor` is the scrape path (`scrapeConfig` disabled) | `eoepca-plus` `argocd/operations/monitoring/` |
+| **Logs** | Alloy DaemonSet auto-collects **every pod's stdout → Loki**; Grafana has a Loki datasource (`uid: loki`) | `eoepca-plus` `argocd/operations/monitoring/alloy/` |
+| **Dashboards** | eoapi-k8s ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) | `charts/eoapi/templates/monitoring/observability.yaml` |
+| **Grafana access** | SSO/OIDC with RBAC roles | `eoepca-plus` kube-prometheus-stack values |
+
+The gap is small: the bundled eoAPI dashboard is resource/autoscaling-oriented (CPU/mem/pods +
+request rate). It lacks operator-facing **error-rate** and **latency-percentile per service**
+views, and there are no eoAPI-specific alert rules. That gap is what this delivery fills.
+
+## The contract — eoAPI ↔ Operations BB (the acceptance artifact)
+
+**What eoAPI exposes**
+- **Metrics:** Prometheus-format metrics. Today these are request rate / errors / latency at the
+  **ingress layer** (per service and route-prefix), already scraped by the cluster Prometheus.
+  (App-level metrics — DB pool, internal latency — are a documented later increment; see
+  *Upstream improvements*.)
+- **Logs:** structured logs to **stdout**, collected by the platform's Alloy → Loki pipeline. No
+  log push from eoAPI; the platform owns collection.
+- **Standards:** Prometheus exposition for metrics; OpenTelemetry is the chosen standard for the
+  later traces increment (backend-neutral via OTLP).
+
+**What the Operations BB provides / owns**
+- Scrapes eoAPI metrics via the existing kube-prometheus-stack (`ServiceMonitor`/`PodMonitor`).
+- Stores and visualizes metrics (Prometheus + Grafana) and logs (Loki, fed by Alloy).
+- Owns **retention, access control (SSO/RBAC), and alert routing** (Alertmanager).
+
+**Shared / handed over by this delivery**
+- An "eoAPI operations" **Grafana dashboard** (shipped as a ConfigMap by the chart, imported by
+  the Operations BB's Grafana sidecar).
+- A small set of **PrometheusRules** (availability / error-rate / latency), opt-in via chart
+  values, wired into the existing Alertmanager.
+
+**Boundaries (explicit)**
+- eoAPI does **not** run its own Prometheus/Grafana/Loki in the cluster — it integrates with the
+  Operations BB's. The chart's bundled monitoring components stay disabled on EOEPCA.
+- Metric label cardinality is bounded by design (no per-collection / per-tile labels in this
+  delivery) to respect the cluster's 30d/50Gi budget.
+
+## First delivery — scope and effort (~15h)
+
+| # | Task | Effort |
+|---|---|---|
+| 1 | Write this contract (the acceptance artifact) and circulate for Operations BB review | 3h |
+| 2 | Confirm on `develop` whether eoAPI is fronted by **nginx-ingress or APISIX**, that request-duration histograms are scraped, and capture exact metric names | 2h |
+| 3 | Build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h |
+| 4 | Add **opt-in PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h |
+| 5 | **EOEPCA enablement = values only** — PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`) enabling the dashboard + rules and confirming the scrape | 2h |
+| 6 | Finalize this doc + a short **#202 note**; hand off for Operations BB sign-off | 2h |
+| | **Total** | **15h** |
+
+No new container images, no application code changes, and no new backend services — that is what
+keeps this within 15h.
+
+## Out of scope (tracked in the roadmap, not this delivery)
+- OpenTelemetry auto-instrumentation (DB spans, internal latency).
+- Distributed **traces** and trace↔log correlation — also blocked by **Tempo not being deployed**
+  on EOEPCA today.
+- **Per-collection analytics** (the eoAPI#193 headline) — needs bounded custom code and a
+  cardinality budget agreed with the Operations BB.
+
+See [observability.md](./observability.md) for the full roadmap and the rationale.
+
+## Upstream improvements (do-it-well, if time allows)
+
+Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home for these is upstream — not
+a per-deployment overlay. These are **separate allocations**, not part of the 15h. Estimates are
+engineering hours (tests/docs/PR/release included); upstream PRs also carry review/CI/release
+latency beyond these hours.
+
+| # | Upstream item | Effort |
+|---|---|---|
+| U1 | **eoapi-k8s** — opt-in `observability` extension: `ServiceMonitor` template + values toggle, and the ops panels folded into the bundled dashboard | 6–9h |
+| U2 | **eoapi-k8s** — shared `OTEL_*` env passthrough scaffolding (a `telemetry` block) | 3–5h |
+| U3 | **eoAPI apps** (titiler-pgstac / stac-fastapi-pgstac / tipg) — opt-in native Prometheus `/metrics` endpoint, off by default (app-level latency/error + DB-pool metrics) | 4–6h per app (~12–18h all three) |
+| U4 | **eoAPI apps** — opt-in OpenTelemetry baked into images (env-driven, off by default); riskiest (dependency-conflict + multi-worker validation); only worthwhile once traces are wanted and Tempo exists | 5–7h per app (~15–20h all three) |
+| U5 | **eoAPI apps** — bounded, opt-in per-collection metric (#193) with cardinality guards + dashboard panel | 8–14h |
+
+**Suggested packaging:**
+- *Phase A (~10–15h)* — U1 + U3 (one app): make app-level metrics first-class in the chart and
+  one app. A natural second ~15h slot after this delivery.
+- *Phase B (~12–18h)* — U3 across the remaining apps.
+- *Phase C (~20–25h)* — U2 + U4 (traces), gated on Tempo being deployed on EOEPCA.
+- *Phase D (~8–14h)* — U5 (per-collection), gated on a cardinality budget agreed with Ops BB.
+
+## Verification / acceptance
+- **AC met:** the contract above is reviewed and accepted by the Operations BB — the single #202
+  checkbox.
+- **Demonstration:** on `develop`, eoAPI service health (request rate / error rate / latency
+  percentiles per service) is visible in the **existing** Grafana via the shipped dashboard; the
+  alert rules load; logs are queryable in the **existing** Loki — with no new images and no app
+  changes.
+- **Metric-source check (task 2)** confirms nginx vs APISIX and that duration histograms are
+  scraped before the dashboard queries are finalized.
diff --git a/mkdocs.yml b/mkdocs.yml
index 3b48f60..5829e5d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
     - Design:
       - design/overview.md
       - design/ogc-api-maps.md
+      - design/observability-operations-bb.md
     - API:
       - api/endpoint-specification.md
       - api/health-checks.md

From 1d1567ca2f259f4d4ce9ac7c287dac67bf57c7c3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:15:40 +0100
Subject: [PATCH 2/8] docs: remove references to the unpublished broader
 observability doc

Avoid confusion: the broader roadmap doc was never published, so drop the
cross-links and keep this design self-contained.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index d571469..b6bdcd0 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -1,10 +1,9 @@
 # Monitoring & Logging integration with the Operations BB
 
 > **Status: Draft for discussion.** Focused first-delivery design for
-> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202). A broader,
-> longer-term observability roadmap (full OpenTelemetry: traces, per-collection analytics) is
-> tracked separately in [observability.md](./observability.md); this document is intentionally
-> narrower and scoped to a single ~15h increment.
+> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), intentionally
+> scoped to a single ~15h increment. Broader observability work (full OpenTelemetry: traces,
+> per-collection analytics) is captured as future roadmap in the *Out of scope* section below.
 
 ## Goal & acceptance criterion
 
@@ -86,14 +85,14 @@ views, and there are no eoAPI-specific alert rules. That gap is what this delive
 No new container images, no application code changes, and no new backend services — that is what
 keeps this within 15h.
 
-## Out of scope (tracked in the roadmap, not this delivery)
+## Out of scope (future roadmap, not this delivery)
 - OpenTelemetry auto-instrumentation (DB spans, internal latency).
 - Distributed **traces** and trace↔log correlation — also blocked by **Tempo not being deployed**
   on EOEPCA today.
 - **Per-collection analytics** (the eoAPI#193 headline) — needs bounded custom code and a
   cardinality budget agreed with the Operations BB.
 
-See [observability.md](./observability.md) for the full roadmap and the rationale.
+These are sketched in *Upstream improvements* below as costed, optional follow-ups.
 
 ## Upstream improvements (do-it-well, if time allows)
 

From d67126090411ca7e2776808dd77e5d92ffe46f54 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:21:23 +0100
Subject: [PATCH 3/8] docs: refocus observability design on eoAPI; EOEPCA =
 config-only

Center the deliverable on eoAPI (the eoapi-k8s chart) as a generic,
opt-in capability for any Prometheus-Operator + Grafana platform, with
EOEPCA reached via a few Helm-values changes rather than bespoke code.

- Add a "Provided by" column splitting what eoAPI ships vs what the
  platform provides (the heart of the contract).
- Add an explicit "EOEPCA integration = a few config changes" section.
- Scope the ServiceMonitor to the later app-/metrics increment (U1/U3):
  the 15h delivery reads ingress metrics already scraped, so it needs
  only dashboard panels + alert rules in the chart.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 112 +++++++++++++--------
 1 file changed, 69 insertions(+), 43 deletions(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index b6bdcd0..5a6831f 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -1,4 +1,4 @@
-# Monitoring & Logging integration with the Operations BB
+# eoAPI Monitoring & Logging (Operations BB integration)
 
 > **Status: Draft for discussion.** Focused first-delivery design for
 > [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), intentionally
@@ -19,56 +19,82 @@ Block."*
 So this delivery is fundamentally **clarification + reuse of what already exists**, not a new
 instrumentation stack. Concretely it produces:
 
-1. A written **contract** describing what eoAPI exposes and how the Operations BB consumes it.
-2. A thin **demonstration** that makes that contract real: eoAPI service health visible in the
-   Operations BB's existing Grafana, driven by metrics that already exist, plus a few alert rules
-   — shipped through the eoapi-k8s chart and enabled on EOEPCA by values only.
+1. A written **contract** describing what eoAPI exposes and what a consuming platform provides.
+2. A thin **demonstration** that makes that contract real: eoAPI service health visible in a
+   Grafana that already exists, driven by metrics that already exist, plus a few alert rules.
 
-**Primary target: eoapi-k8s** (the chart we maintain), then a values-only adaptation for the
-EOEPCA deployment.
+**The work is delivered in eoAPI itself — the `eoapi-k8s` chart we maintain — as a generic,
+opt-in capability** that works with any Prometheus-Operator + Grafana platform. **EOEPCA is
+then reached with a few configuration (Helm values) changes**, not bespoke code. Doing it in the
+chart (rather than as an EOEPCA-only overlay) keeps it reusable and low-maintenance for every
+eoAPI deployment, EOEPCA included.
 
 ## What already exists (and is therefore reused, not rebuilt)
 
-| Layer | Already present | Source |
+The split below is the heart of the contract: a few things ship with **eoAPI** (the `eoapi-k8s`
+chart, generic), and the rest is provided by the **consuming platform** (here, EOEPCA's
+Operations BB — but any Prometheus-Operator + Grafana stack works).
+
+| Capability | Provided by | Already present |
 |---|---|---|
-| **Request metrics** | Per-service / per-route request counts (by status) and request-duration histograms at the ingress | ingress controller metrics, already scraped by the cluster Prometheus (the chart's autoscaling uses them) |
-| **Metrics backend** | kube-prometheus-stack (Prometheus + Grafana + Alertmanager), 30d / 50Gi retention; `ServiceMonitor`/`PodMonitor` is the scrape path (`scrapeConfig` disabled) | `eoepca-plus` `argocd/operations/monitoring/` |
-| **Logs** | Alloy DaemonSet auto-collects **every pod's stdout → Loki**; Grafana has a Loki datasource (`uid: loki`) | `eoepca-plus` `argocd/operations/monitoring/alloy/` |
-| **Dashboards** | eoapi-k8s ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) | `charts/eoapi/templates/monitoring/observability.yaml` |
-| **Grafana access** | SSO/OIDC with RBAC roles | `eoepca-plus` kube-prometheus-stack values |
+| **Grafana dashboard + delivery mechanism** | **eoAPI** (`eoapi-k8s`) | Chart ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) — `charts/eoapi/templates/monitoring/observability.yaml` |
+| **Request metrics consumption** | **eoAPI** (`eoapi-k8s`) | Chart already consumes per-service request-rate metrics for autoscaling (HPA), so the query shapes are known |
+| **Request metrics source** | **Platform** | Per-service / per-route request counts (by status) + request-duration histograms from the ingress controller |
+| **Metrics backend** | **Platform** | Prometheus-Operator (e.g. kube-prometheus-stack: Prometheus + Grafana + Alertmanager); `ServiceMonitor`/`PodMonitor` is the scrape path. *(EOEPCA: 30d / 50Gi, `scrapeConfig` disabled.)* |
+| **Logs** | **Platform** | Log shipper auto-collects pod stdout → Loki. *(EOEPCA: Alloy DaemonSet; Grafana Loki datasource `uid: loki`.)* |
+| **Grafana access** | **Platform** | Auth/RBAC. *(EOEPCA: SSO/OIDC with roles.)* |
 
 The gap is small: the bundled eoAPI dashboard is resource/autoscaling-oriented (CPU/mem/pods +
 request rate). It lacks operator-facing **error-rate** and **latency-percentile per service**
-views, and there are no eoAPI-specific alert rules. That gap is what this delivery fills.
-
-## The contract — eoAPI ↔ Operations BB (the acceptance artifact)
-
-**What eoAPI exposes**
-- **Metrics:** Prometheus-format metrics. Today these are request rate / errors / latency at the
-  **ingress layer** (per service and route-prefix), already scraped by the cluster Prometheus.
-  (App-level metrics — DB pool, internal latency — are a documented later increment; see
-  *Upstream improvements*.)
-- **Logs:** structured logs to **stdout**, collected by the platform's Alloy → Loki pipeline. No
-  log push from eoAPI; the platform owns collection.
-- **Standards:** Prometheus exposition for metrics; OpenTelemetry is the chosen standard for the
-  later traces increment (backend-neutral via OTLP).
-
-**What the Operations BB provides / owns**
-- Scrapes eoAPI metrics via the existing kube-prometheus-stack (`ServiceMonitor`/`PodMonitor`).
-- Stores and visualizes metrics (Prometheus + Grafana) and logs (Loki, fed by Alloy).
-- Owns **retention, access control (SSO/RBAC), and alert routing** (Alertmanager).
-
-**Shared / handed over by this delivery**
-- An "eoAPI operations" **Grafana dashboard** (shipped as a ConfigMap by the chart, imported by
-  the Operations BB's Grafana sidecar).
-- A small set of **PrometheusRules** (availability / error-rate / latency), opt-in via chart
-  values, wired into the existing Alertmanager.
+views, and eoAPI ships no alert rules. **That gap — dashboard panels + alert rules, in the
+chart — is what this delivery fills**, generically. (No new scrape target is needed: it reads
+ingress metrics the platform already collects.)
+
+## The contract — what eoAPI provides vs. what the platform provides
+
+This is the acceptance artifact. It is written for **eoAPI in general**; the Operations BB is the
+first consumer.
+
+**What eoAPI provides (in the `eoapi-k8s` chart)**
+- A **Grafana dashboard** ("eoAPI operations": request rate, error rate by status, latency
+  p50/p95/p99 per service) shipped as a ConfigMap, with a **templated datasource** so it imports
+  into any Grafana.
+- A small set of opt-in **PrometheusRules** (availability / error-rate / latency), so any
+  Prometheus-Operator platform alerts on eoAPI without bespoke wiring. *(An opt-in
+  `ServiceMonitor` is added later, once eoAPI exposes its own `/metrics` — see U1/U3 in
+  Upstream improvements; the first delivery needs none, because it reads ingress metrics the
+  platform already scrapes.)*
+- **Metrics** in Prometheus format (today: request rate / errors / latency at the ingress, per
+  service and route-prefix; app-level metrics are a later increment — see *Upstream improvements*).
+- **Logs** to **stdout** in a structured form, for the platform's log shipper to collect. eoAPI
+  does not push logs.
+- **Standards:** Prometheus exposition for metrics; OpenTelemetry (backend-neutral OTLP) is the
+  chosen standard for the later traces increment.
+
+**What the consuming platform provides (e.g. EOEPCA's Operations BB)**
+- A **Prometheus-Operator** stack (scrapes targets via `ServiceMonitor`/`PodMonitor`; today it
+  already scrapes the ingress controller's metrics), plus **Grafana** (with a dashboard sidecar)
+  and **Alertmanager**.
+- A **log pipeline** (shipper → Loki) that collects pod stdout.
+- **Retention, access control, and alert routing.**
 
 **Boundaries (explicit)**
-- eoAPI does **not** run its own Prometheus/Grafana/Loki in the cluster — it integrates with the
-  Operations BB's. The chart's bundled monitoring components stay disabled on EOEPCA.
+- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's.
+  The chart's bundled monitoring components stay disabled when a platform stack is present.
 - Metric label cardinality is bounded by design (no per-collection / per-tile labels in this
-  delivery) to respect the cluster's 30d/50Gi budget.
+  delivery) to respect platform budgets (EOEPCA: 30d / 50Gi).
+
+### EOEPCA integration = a few config changes
+Because the capability lives in the chart, bringing it to EOEPCA is **values only** in
+`eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`):
+1. **Enable** the chart's observability extension (dashboard ConfigMap + PrometheusRules).
+2. **Match the dashboard sidecar** label/namespace so EOEPCA's Grafana imports it
+   (`eoapi_dashboard` → confirm against the cluster's sidecar config).
+3. **Confirm the scrape target** (ingress metric source: nginx vs APISIX) and leave the chart's
+   bundled Prometheus **disabled** (the cluster stack is used instead).
+
+No image rebuilds, no application changes — ArgoCD syncs the values and the dashboard/rules
+appear in the existing Grafana/Prometheus.
 
 ## First delivery — scope and effort (~15h)
 
@@ -76,9 +102,9 @@ views, and there are no eoAPI-specific alert rules. That gap is what this delive
 |---|---|---|
 | 1 | Write this contract (the acceptance artifact) and circulate for Operations BB review | 3h |
 | 2 | Confirm on `develop` whether eoAPI is fronted by **nginx-ingress or APISIX**, that request-duration histograms are scraped, and capture exact metric names | 2h |
-| 3 | Build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h |
-| 4 | Add **opt-in PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h |
-| 5 | **EOEPCA enablement = values only** — PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`) enabling the dashboard + rules and confirming the scrape | 2h |
+| 3 | **In `eoapi-k8s` (generic):** build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h |
+| 4 | **In `eoapi-k8s` (generic):** add opt-in **PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h |
+| 5 | **EOEPCA integration = a few config changes** — values PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): enable the extension, match the dashboard sidecar, confirm the scrape | 2h |
 | 6 | Finalize this doc + a short **#202 note**; hand off for Operations BB sign-off | 2h |
 | | **Total** | **15h** |
 

From 127f9e67024a5bd86379f13a09efedb42be9b052 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:41:29 +0100
Subject: [PATCH 4/8] docs: align observability design with the Operations BB
 spec
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reviewed against the Operations BB docs (STAC Scenario, ServiceMonitors,
Alerting & SLOs) and corrected the design:

- Ingress is APISIX, not nginx — the chart's nginx_ingress_controller_*
  metrics/autoscaling don't exist on EOEPCA; document APISIX
  (apisix_http_status / apisix_http_latency) + postgres-exporter + Keep.
- The BB already has gateway/DB/log/SLO machinery; the named gap is
  native app metrics (/metrics returns 404 on eoapi-stac and
  stac-auth-proxy; no data-access ServiceMonitor).
- Promote native, bounded /metrics + ServiceMonitor + dashboard/alert
  panels (the STAC slice) into the first delivery; this matches the BB's
  stated desired end state and feeds the existing 99%/500ms burn-rate SLO.
- Keep bounded-cardinality, Alloy logs, platform ownership; defer traces
  (no Tempo) and per-collection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 240 ++++++++++-----------
 1 file changed, 116 insertions(+), 124 deletions(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index 5a6831f..3c54e4e 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -1,9 +1,10 @@
 # eoAPI Monitoring & Logging (Operations BB integration)
 
 > **Status: Draft for discussion.** Focused first-delivery design for
-> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), intentionally
-> scoped to a single ~15h increment. Broader observability work (full OpenTelemetry: traces,
-> per-collection analytics) is captured as future roadmap in the *Out of scope* section below.
+> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), aligned with the
+> [Operations BB documentation](https://eoepca.readthedocs.io/projects/operations/en/latest/)
+> (in particular its *STAC Scenario*, *ServiceMonitors*, and *Alerting and SLOs* pages).
+> Scoped to a single increment (~15h). Tracing and per-collection analytics are future roadmap.
 
 ## Goal & acceptance criterion
 
@@ -12,142 +13,133 @@ data-access#202 asks for one thing:
 > *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging
 > capabilities of eoAPI."*
 
-and notes that *"eoAPI already collects metrics and offers observability tools such as Grafana.
-The goal of this ticket is to support that these metrics become part of the Operations Building
-Block."*
+The Operations BB's **STAC Scenario** already states the concrete expectation, so the
+"clarification" is really *"adopt what the Operations BB already specifies and close the gap it
+names."* This delivery therefore produces:
 
-So this delivery is fundamentally **clarification + reuse of what already exists**, not a new
-instrumentation stack. Concretely it produces:
+1. A written **contract** — what eoAPI exposes vs. what the Operations BB provides.
+2. The **first concrete step** of the BB's stated desired end state: native, bounded application
+   metrics from `eoapi-stac` and `eoapi-stac-auth-proxy`, a `ServiceMonitor`, and dashboard/alert
+   panels built on them.
 
-1. A written **contract** describing what eoAPI exposes and what a consuming platform provides.
-2. A thin **demonstration** that makes that contract real: eoAPI service health visible in a
-   Grafana that already exists, driven by metrics that already exist, plus a few alert rules.
+**Delivered in eoAPI itself** — the apps (`stac-fastapi-pgstac`, `stac-auth-proxy`) and the
+`eoapi-k8s` chart we maintain — as a **generic, opt-in capability**. **EOEPCA is then reached
+with a few Helm-values changes.** Doing it upstream (not as an EOEPCA-only overlay) keeps it
+reusable and low-maintenance for every eoAPI deployment.
 
-**The work is delivered in eoAPI itself — the `eoapi-k8s` chart we maintain — as a generic,
-opt-in capability** that works with any Prometheus-Operator + Grafana platform. **EOEPCA is
-then reached with a few configuration (Helm values) changes**, not bespoke code. Doing it in the
-chart (rather than as an EOEPCA-only overlay) keeps it reusable and low-maintenance for every
-eoAPI deployment, EOEPCA included.
+## How the Operations BB monitors eoAPI today
 
-## What already exists (and is therefore reused, not rebuilt)
+From the Operations BB docs, the request path is
+`client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack
+is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There
+is no tracing backend (no Tempo).** What already exists:
 
-The split below is the heart of the contract: a few things ship with **eoAPI** (the `eoapi-k8s`
-chart, generic), and the rest is provided by the **consuming platform** (here, EOEPCA's
-Operations BB — but any Prometheus-Operator + Grafana stack works).
-
-| Capability | Provided by | Already present |
+| Signal | Source today | Notes |
 |---|---|---|
-| **Grafana dashboard + delivery mechanism** | **eoAPI** (`eoapi-k8s`) | Chart ships `eoAPI-Dashboard.json` and an `observability` values block + ConfigMap mechanism (label `eoapi_dashboard`) — `charts/eoapi/templates/monitoring/observability.yaml` |
-| **Request metrics consumption** | **eoAPI** (`eoapi-k8s`) | Chart already consumes per-service request-rate metrics for autoscaling (HPA), so the query shapes are known |
-| **Request metrics source** | **Platform** | Per-service / per-route request counts (by status) + request-duration histograms from the ingress controller |
-| **Metrics backend** | **Platform** | Prometheus-Operator (e.g. kube-prometheus-stack: Prometheus + Grafana + Alertmanager); `ServiceMonitor`/`PodMonitor` is the scrape path. *(EOEPCA: 30d / 50Gi, `scrapeConfig` disabled.)* |
-| **Logs** | **Platform** | Log shipper auto-collects pod stdout → Loki. *(EOEPCA: Alloy DaemonSet; Grafana Loki datasource `uid: loki`.)* |
-| **Grafana access** | **Platform** | Auth/RBAC. *(EOEPCA: SSO/OIDC with roles.)* |
-
-The gap is small: the bundled eoAPI dashboard is resource/autoscaling-oriented (CPU/mem/pods +
-request rate). It lacks operator-facing **error-rate** and **latency-percentile per service**
-views, and eoAPI ships no alert rules. **That gap — dashboard panels + alert rules, in the
-chart — is what this delivery fills**, generically. (No new scrape target is needed: it reads
-ingress metrics the platform already collects.)
+| **Gateway metrics** | **APISIX** `/apisix/prometheus/metrics`, scraped by a `ServiceMonitor` in `ingress-apisix` | `apisix_http_status` (by code/route/method) and `apisix_http_latency` (`type=request\|upstream\|apisix`) — bounded labels; three latency views |
+| **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` |
+| **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads |
+| **Synthetic checks** | external black-box probe | validates the public STAC endpoint |
+| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records (e.g. `stac_get_latency_500ms_burn_rate_1h`, critical at `> 14.4`); GET/POST request records are the alert basis, with gateway/DB as diagnosis |
+
+> **Important correction:** EOEPCA's ingress is **APISIX, not nginx-ingress.** The `eoapi-k8s`
+> chart's autoscaling/dashboard use `nginx_ingress_controller_*` metrics, which **do not exist**
+> on EOEPCA. Any eoAPI-shipped dashboard must therefore use APISIX (or native app) metrics.
+
+### The gap the Operations BB names
+- `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`.
+- There are **no `ServiceMonitor` objects in the `data-access` namespace**.
+- Alerts/SLOs rely on **indirect gateway signals**; the BB explicitly wants *"semantic,
+  low-cardinality metrics from the application itself, then scrape them with a `ServiceMonitor`"*
+  to measure request rate, errors and latency **directly from the service**.
+
+So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application
+metrics**. That is what this delivery contributes.
 
 ## The contract — what eoAPI provides vs. what the platform provides
 
-This is the acceptance artifact. It is written for **eoAPI in general**; the Operations BB is the
-first consumer.
-
-**What eoAPI provides (in the `eoapi-k8s` chart)**
-- A **Grafana dashboard** ("eoAPI operations": request rate, error rate by status, latency
-  p50/p95/p99 per service) shipped as a ConfigMap, with a **templated datasource** so it imports
-  into any Grafana.
-- A small set of opt-in **PrometheusRules** (availability / error-rate / latency), so any
-  Prometheus-Operator platform alerts on eoAPI without bespoke wiring. *(An opt-in
-  `ServiceMonitor` is added later, once eoAPI exposes its own `/metrics` — see U1/U3 in
-  Upstream improvements; the first delivery needs none, because it reads ingress metrics the
-  platform already scrapes.)*
-- **Metrics** in Prometheus format (today: request rate / errors / latency at the ingress, per
-  service and route-prefix; app-level metrics are a later increment — see *Upstream improvements*).
-- **Logs** to **stdout** in a structured form, for the platform's log shipper to collect. eoAPI
-  does not push logs.
-- **Standards:** Prometheus exposition for metrics; OpenTelemetry (backend-neutral OTLP) is the
-  chosen standard for the later traces increment.
-
-**What the consuming platform provides (e.g. EOEPCA's Operations BB)**
-- A **Prometheus-Operator** stack (scrapes targets via `ServiceMonitor`/`PodMonitor`; today it
-  already scrapes the ingress controller's metrics), plus **Grafana** (with a dashboard sidecar)
-  and **Alertmanager**.
-- A **log pipeline** (shipper → Loki) that collects pod stdout.
-- **Retention, access control, and alert routing.**
-
-**Boundaries (explicit)**
-- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's.
-  The chart's bundled monitoring components stay disabled when a platform stack is present.
-- Metric label cardinality is bounded by design (no per-collection / per-tile labels in this
-  delivery) to respect platform budgets (EOEPCA: 30d / 50Gi).
+This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first
+consumer.
+
+| Capability | Provided by | Detail |
+|---|---|---|
+| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template, method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only. |
+| **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. |
+| **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. |
+| **Structured logs** | **eoAPI** | Logs to **stdout** for the platform's shipper; eoAPI does not push logs. |
+| **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). |
+| **DB metrics** | **Platform** | postgres-exporter. |
+| **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. |
+| **Log pipeline** | **Platform** | Alloy → Loki. |
+
+**Boundaries**
+- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's; the
+  chart's bundled monitoring components stay disabled where a platform stack exists.
+- **Cardinality is bounded by design** — route template / method / status class only; **no full
+  URLs, no per-collection / per-tile labels** (the BB calls out high-cardinality churn explicitly;
+  EOEPCA has been bitten before). Per-collection analytics are out of scope (see roadmap).
 
 ### EOEPCA integration = a few config changes
-Because the capability lives in the chart, bringing it to EOEPCA is **values only** in
+Because the capability lives in the apps + chart, bringing it to EOEPCA is **values only** in
 `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`):
-1. **Enable** the chart's observability extension (dashboard ConfigMap + PrometheusRules).
-2. **Match the dashboard sidecar** label/namespace so EOEPCA's Grafana imports it
-   (`eoapi_dashboard` → confirm against the cluster's sidecar config).
-3. **Confirm the scrape target** (ingress metric source: nginx vs APISIX) and leave the chart's
-   bundled Prometheus **disabled** (the cluster stack is used instead).
+1. **Enable** native metrics (env) on `eoapi-stac` / `eoapi-stac-auth-proxy`.
+2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana
+   sidecar label/namespace (`eoapi_dashboard`).
+3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new
+   ServiceMonitor). Over time, point the existing STAC burn-rate records at the **native
+   request metrics** instead of the indirect gateway signal.
 
-No image rebuilds, no application changes — ArgoCD syncs the values and the dashboard/rules
-appear in the existing Grafana/Prometheus.
+No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values.
 
-## First delivery — scope and effort (~15h)
+## First delivery — scope and effort (~15h, the STAC slice)
 
-| # | Task | Effort |
-|---|---|---|
-| 1 | Write this contract (the acceptance artifact) and circulate for Operations BB review | 3h |
-| 2 | Confirm on `develop` whether eoAPI is fronted by **nginx-ingress or APISIX**, that request-duration histograms are scraped, and capture exact metric names | 2h |
-| 3 | **In `eoapi-k8s` (generic):** build the **"eoAPI operations" Grafana dashboard** (request rate, error rate by status, latency p50/p95/p99 per service) from existing metrics; templated datasource; ship via the chart's existing `observability` ConfigMap mechanism | 4h |
-| 4 | **In `eoapi-k8s` (generic):** add opt-in **PrometheusRules** (availability / error-rate / latency) as chart values + template | 2h |
-| 5 | **EOEPCA integration = a few config changes** — values PR to `eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): enable the extension, match the dashboard sidecar, confirm the scrape | 2h |
-| 6 | Finalize this doc + a short **#202 note**; hand off for Operations BB sign-off | 2h |
-| | **Total** | **15h** |
-
-No new container images, no application code changes, and no new backend services — that is what
-keeps this within 15h.
-
-## Out of scope (future roadmap, not this delivery)
-- OpenTelemetry auto-instrumentation (DB spans, internal latency).
-- Distributed **traces** and trace↔log correlation — also blocked by **Tempo not being deployed**
-  on EOEPCA today.
-- **Per-collection analytics** (the eoAPI#193 headline) — needs bounded custom code and a
-  cardinality budget agreed with the Operations BB.
-
-These are sketched in *Upstream improvements* below as costed, optional follow-ups.
-
-## Upstream improvements (do-it-well, if time allows)
-
-Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home for these is upstream — not
-a per-deployment overlay. These are **separate allocations**, not part of the 15h. Estimates are
-engineering hours (tests/docs/PR/release included); upstream PRs also carry review/CI/release
-latency beyond these hours.
-
-| # | Upstream item | Effort |
+The STAC Scenario is the BB's worked example, so the first increment targets it end-to-end.
+
+| # | Task | Where | Effort |
+|---|---|---|---|
+| 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h |
+| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template, method, status class) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h |
+| 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h |
+| 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h |
+| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how native records feed the existing 99%/500ms burn-rate rules) | docs | 1h |
+| | **Total** | | **15h** |
+
+**Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app
+release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the
+clarification (the #202 AC) can land first while the `/metrics` PRs go through.
+
+## EOEPCA enablement (values only)
+A follow-on values PR to `eoepca-plus` enables metrics + ServiceMonitor + dashboard and confirms
+the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/chart work.)
+
+## Out of scope (future roadmap)
+- **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac),
+  vector (tipg), multidim.
+- **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA.
+- **Per-collection analytics** (eoAPI#193) — high cardinality; needs a bounded design and a
+  budget agreed with the Operations BB.
+
+## Upstream improvements (do-it-well, beyond the first delivery)
+
+Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home is upstream. Estimates are
+engineering hours (tests/docs included); upstream PRs also carry review/CI/release latency.
+
+| # | Item | Effort |
 |---|---|---|
-| U1 | **eoapi-k8s** — opt-in `observability` extension: `ServiceMonitor` template + values toggle, and the ops panels folded into the bundled dashboard | 6–9h |
-| U2 | **eoapi-k8s** — shared `OTEL_*` env passthrough scaffolding (a `telemetry` block) | 3–5h |
-| U3 | **eoAPI apps** (titiler-pgstac / stac-fastapi-pgstac / tipg) — opt-in native Prometheus `/metrics` endpoint, off by default (app-level latency/error + DB-pool metrics) | 4–6h per app (~12–18h all three) |
-| U4 | **eoAPI apps** — opt-in OpenTelemetry baked into images (env-driven, off by default); riskiest (dependency-conflict + multi-worker validation); only worthwhile once traces are wanted and Tempo exists | 5–7h per app (~15–20h all three) |
-| U5 | **eoAPI apps** — bounded, opt-in per-collection metric (#193) with cardinality guards + dashboard panel | 8–14h |
-
-**Suggested packaging:**
-- *Phase A (~10–15h)* — U1 + U3 (one app): make app-level metrics first-class in the chart and
-  one app. A natural second ~15h slot after this delivery.
-- *Phase B (~12–18h)* — U3 across the remaining apps.
-- *Phase C (~20–25h)* — U2 + U4 (traces), gated on Tempo being deployed on EOEPCA.
-- *Phase D (~8–14h)* — U5 (per-collection), gated on a cardinality budget agreed with Ops BB.
+| U1 | Extend native `/metrics` to **raster / vector / multidim** (same pattern as STAC) | 4–6h per app |
+| U2 | First-class `eoapi-k8s` **`observability`/`telemetry` values block** (toggle metrics + ServiceMonitor + dashboard + standard `OTEL_*` env passthrough) | 6–9h |
+| U3 | Opt-in **OpenTelemetry / traces** baked into images (env-driven, off by default); riskiest (dependency-conflict + uvicorn multi-worker `WEB_CONCURRENCY` validation); only once traces are wanted **and** Tempo exists | 5–7h per app |
+| U4 | Bounded, opt-in **per-collection metric** (#193) with cardinality guards + dashboard panel | 8–14h |
 
 ## Verification / acceptance
-- **AC met:** the contract above is reviewed and accepted by the Operations BB — the single #202
-  checkbox.
-- **Demonstration:** on `develop`, eoAPI service health (request rate / error rate / latency
-  percentiles per service) is visible in the **existing** Grafana via the shipped dashboard; the
-  alert rules load; logs are queryable in the **existing** Loki — with no new images and no app
-  changes.
-- **Metric-source check (task 2)** confirms nginx vs APISIX and that duration histograms are
-  scraped before the dashboard queries are finalized.
+- **AC met:** the contract is reviewed and accepted by the Operations BB — the #202 checkbox.
+- **Gap closed:** `GET /metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy` returns **200**
+  (not 404), exposing bounded, route-templated series.
+- **Scraped:** the `ServiceMonitor` is picked up by the cluster Prometheus
+  (`kubectl get servicemonitor -n data-access`; series present).
+- **Dashboard:** native rate / error / latency-percentile per-service panels populate in the
+  existing Grafana via the shipped ConfigMap.
+- **SLO:** native request records can feed the existing **99% < 500ms** burn-rate rules
+  (`stac_get_latency_500ms_*`), replacing the indirect gateway signal.
+- **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them).
+- **Cardinality check:** no full-URL / per-collection labels on any new series.

From f4618ffc9cfa69c7f8af6e8d1576cd64c1fd4229 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:47:56 +0100
Subject: [PATCH 5/8] docs: fold in Operations BB demo insights
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

From the Ops BB (Versioneer/ESA) STAC SLO demo:
- The missing signal is operation-level latency (search vs item-listing
  vs collection-listing); URL-as-label is the cardinality trap they
  backed off — native route-template metrics solve it safely.
- POST /search is the latency-critical path the SLO alert targets.
- Agreed minimum bar: enable the framework's out-of-the-box metrics.
- Alerting is SLO/user-impact-centric, not CPU/memory.
- Keep enrichment correlates gateway/app/DB latency; native app-layer
  latency makes the "is it the app?" branch accurate.
- Per-collection (e.g. only VHR slow) is a real but deeper, deferred need.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 29 +++++++++++++++++++---
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index 3c54e4e..d73613d 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -56,6 +56,26 @@ is no tracing backend (no Tempo).** What already exists:
 So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application
 metrics**. That is what this delivery contributes.
 
+### Design implications from the Operations BB demo
+The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow. Points that shape this design:
+
+- **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC
+  request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only
+  sees GET/POST, and putting the **URL in a label is a cardinality problem they tried and backed
+  off**. Native metrics keyed by **route template (= operation)** close this safely.
+- **POST `/search` is the latency-critical path** ("most of the stuff that takes longer is a
+  POST"); the SLO burn-rate alert is on STAC POST latency. Instrument and surface it prominently.
+- **Minimum bar = "what the framework gives for free."** ESA and the BB owner agreed the baseline
+  is enabling the out-of-the-box FastAPI route/method/latency-bucket metrics (a library + little
+  code) — exactly the opt-in `/metrics` proposed here.
+- **Alerting is SLO / user-impact-centric, not CPU/memory.** Per-workload CPU/mem already comes
+  for free and is *not* alerted on; eoAPI's contribution is the RED signals behind the SLO and the
+  dashboard the alert links to.
+- **Keep enrichment correlates gateway vs app vs DB latency** to localize the bottleneck (e.g.
+  "DB mean 11 ms → not the DB; app burn 2.2 → app fine"). Native **app-layer** latency makes the
+  "is it the app?" branch accurate instead of inferred from APISIX upstream latency. eoAPI's
+  metrics/rules should carry **stable labels** so Keep can pull and correlate them.
+
 ## The contract — what eoAPI provides vs. what the platform provides
 
 This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first
@@ -63,7 +83,7 @@ consumer.
 
 | Capability | Provided by | Detail |
 |---|---|---|
-| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template, method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only. |
+| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. |
 | **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. |
 | **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. |
 | **Structured logs** | **eoAPI** | Logs to **stdout** for the platform's shipper; eoAPI does not push logs. |
@@ -98,7 +118,7 @@ The STAC Scenario is the BB's worked example, so the first increment targets it
 | # | Task | Where | Effort |
 |---|---|---|---|
 | 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h |
-| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template, method, status class) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h |
+| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h |
 | 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h |
 | 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h |
 | 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how native records feed the existing 99%/500ms burn-rate rules) | docs | 1h |
@@ -116,8 +136,9 @@ the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/char
 - **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac),
   vector (tipg), multidim.
 - **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA.
-- **Per-collection analytics** (eoAPI#193) — high cardinality; needs a bounded design and a
-  budget agreed with the Operations BB.
+- **Per-collection analytics** (eoAPI#193) — a real root-cause need (the Ops BB demo's example:
+  "only the VHR data is slow"), but high cardinality and needs service-specific knowledge + a
+  bounded design and budget agreed with the Operations BB.
 
 ## Upstream improvements (do-it-well, beyond the first delivery)
 

From 578ca2a0917b2e261b4914ff36ab80f29e45f586 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:52:01 +0100
Subject: [PATCH 6/8] docs: phrase observability design as a standalone first
 version

Drop revision-style framing so the document reads as an original
proposal (no implied earlier published version).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index d73613d..fea539f 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -42,9 +42,9 @@ is no tracing backend (no Tempo).** What already exists:
 | **Synthetic checks** | external black-box probe | validates the public STAC endpoint |
 | **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records (e.g. `stac_get_latency_500ms_burn_rate_1h`, critical at `> 14.4`); GET/POST request records are the alert basis, with gateway/DB as diagnosis |
 
-> **Important correction:** EOEPCA's ingress is **APISIX, not nginx-ingress.** The `eoapi-k8s`
-> chart's autoscaling/dashboard use `nginx_ingress_controller_*` metrics, which **do not exist**
-> on EOEPCA. Any eoAPI-shipped dashboard must therefore use APISIX (or native app) metrics.
+> **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s`
+> chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not
+> present** on EOEPCA, so any eoAPI-shipped dashboard uses APISIX (or native app) metrics.
 
 ### The gap the Operations BB names
 - `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`.

From baeb4e62e501170d528cfbb6a9d7db940d2ad3cf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 15:55:17 +0100
Subject: [PATCH 7/8] docs: link the EOEPCA+ demo (8 May 2026) in the
 observability design

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index fea539f..39e9ec0 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -57,7 +57,9 @@ So the gateway/DB/log/SLO machinery already exists — **the missing piece is na
 metrics**. That is what this delivery contributes.
 
 ### Design implications from the Operations BB demo
-The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow. Points that shape this design:
+The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow at the
+[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link).
+Points that shape this design:
 
 - **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC
   request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only

From 96f0a28e4704519db18adb8484f3b760debc5a0f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lo=C3=AFc=20Houpert?=
 <10154151+lhoupert@users.noreply.github.com>
Date: Thu, 4 Jun 2026 16:05:25 +0100
Subject: [PATCH 8/8] docs: tighten observability design after review

- Clarify GET vs POST burn-rate records; POST /search is the critical path.
- Note /metrics is cluster-internal only (port 8080), not via public APISIX
  ingress (matters for the auth proxy).
- Make burn-rate rebasing the Ops BB's decision; eoAPI only exposes signals.
- Soften the logs row (collected as-is; structured logging is later).
- Add deployed-name <-> upstream-package mapping and cross-repo dependency
  order; flag the demo Drive link as access-restricted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/design/observability-operations-bb.md | 35 ++++++++++++++--------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
index 39e9ec0..6e3c6d9 100644
--- a/docs/design/observability-operations-bb.md
+++ b/docs/design/observability-operations-bb.md
@@ -32,7 +32,12 @@ reusable and low-maintenance for every eoAPI deployment.
 From the Operations BB docs, the request path is
 `client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack
 is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There
-is no tracing backend (no Tempo).** What already exists:
+is no tracing backend (no Tempo).**
+
+> Deployed-service ↔ upstream-package names: **`eoapi-stac`** = `stac-fastapi-pgstac`,
+> **`eoapi-stac-auth-proxy`** = `stac-auth-proxy` (both FastAPI/Starlette apps we maintain).
+
+What already exists:
 
 | Signal | Source today | Notes |
 |---|---|---|
@@ -40,7 +45,7 @@ is no tracing backend (no Tempo).** What already exists:
 | **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` |
 | **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads |
 | **Synthetic checks** | external black-box probe | validates the public STAC endpoint |
-| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records (e.g. `stac_get_latency_500ms_burn_rate_1h`, critical at `> 14.4`); GET/POST request records are the alert basis, with gateway/DB as diagnosis |
+| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records exist for **both GET and POST** (`stac_get_latency_500ms_burn_rate_1h` / `stac_post_latency_500ms_burn_rate_1h`, critical at `> 14.4`). **POST `/search` is the latency-critical path**; gateway/DB records serve as diagnosis |
 
 > **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s`
 > chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not
@@ -57,9 +62,9 @@ So the gateway/DB/log/SLO machinery already exists — **the missing piece is na
 metrics**. That is what this delivery contributes.
 
 ### Design implications from the Operations BB demo
-The Operations BB team (Versioneer / ESA) demoed the STAC SLO workflow at the
-[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link).
-Points that shape this design:
+The Operations BB team demoed the STAC SLO workflow at the
+[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link)
+(recording is access-restricted Google Drive). Points that shape this design:
 
 - **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC
   request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only
@@ -85,10 +90,10 @@ consumer.
 
 | Capability | Provided by | Detail |
 |---|---|---|
-| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`: request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. |
+| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`, on the **existing app port (8080), cluster-internal only** (scraped by the ServiceMonitor; **not** routed through the public APISIX ingress). Request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. |
 | **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. |
 | **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. |
-| **Structured logs** | **eoAPI** | Logs to **stdout** for the platform's shipper; eoAPI does not push logs. |
+| **Logs** | **eoAPI** | Logs to **stdout**, collected **as-is** by the platform's shipper (Alloy→Loki); eoAPI does not push logs. Structured/JSON logging is a later increment, not part of this delivery. |
 | **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). |
 | **DB metrics** | **Platform** | postgres-exporter. |
 | **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. |
@@ -108,8 +113,8 @@ Because the capability lives in the apps + chart, bringing it to EOEPCA is **val
 2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana
    sidecar label/namespace (`eoapi_dashboard`).
 3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new
-   ServiceMonitor). Over time, point the existing STAC burn-rate records at the **native
-   request metrics** instead of the indirect gateway signal.
+   ServiceMonitor). The Ops BB may then **choose** to rebase its STAC burn-rate records onto the
+   native request metrics — that is the BB's decision; eoAPI only exposes the signal.
 
 No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values.
 
@@ -123,9 +128,14 @@ The STAC Scenario is the BB's worked example, so the first increment targets it
 | 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h |
 | 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h |
 | 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h |
-| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how native records feed the existing 99%/500ms burn-rate rules) | docs | 1h |
+| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how the native series can back the existing 99%/500ms SLO — the BB owns whether to rebase its rules) | docs | 1h |
 | | **Total** | | **15h** |
 
+**Dependency order (critical path):** app `/metrics` released (tasks 2–3, repos
+`stac-fastapi-pgstac` + `stac-auth-proxy`) → chart `ServiceMonitor` + dashboard (task 4,
+`eoapi-k8s`) → EOEPCA values enablement (`eoepca-plus`). The clarification (task 1) and the
+chart work (task 4) can proceed in parallel.
+
 **Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app
 release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the
 clarification (the #202 AC) can land first while the `/metrics` PRs go through.
@@ -162,7 +172,8 @@ engineering hours (tests/docs included); upstream PRs also carry review/CI/relea
   (`kubectl get servicemonitor -n data-access`; series present).
 - **Dashboard:** native rate / error / latency-percentile per-service panels populate in the
   existing Grafana via the shipped ConfigMap.
-- **SLO:** native request records can feed the existing **99% < 500ms** burn-rate rules
-  (`stac_get_latency_500ms_*`), replacing the indirect gateway signal.
+- **SLO:** the native request series are suitable to back the existing **99% < 500ms** SLO
+  (`stac_get_latency_500ms_*` / `stac_post_latency_500ms_*`); whether the Ops BB rebases its
+  burn-rate rules onto them, instead of the indirect gateway signal, is the BB's decision.
 - **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them).
 - **Cardinality check:** no full-URL / per-collection labels on any new series.