EOEPCA · lhoupert · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md
@@ -0,0 +1,179 @@
+# eoAPI Monitoring & Logging (Operations BB integration)
+
+> **Status: Draft for discussion.** Focused first-delivery design for
+> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), aligned with the
+> [Operations BB documentation](https://eoepca.readthedocs.io/projects/operations/en/latest/)
+> (in particular its *STAC Scenario*, *ServiceMonitors*, and *Alerting and SLOs* pages).
+> Scoped to a single increment (~15h). Tracing and per-collection analytics are future roadmap.
+
+## Goal & acceptance criterion
+
+data-access#202 asks for one thing:
+
+> *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging
+> capabilities of eoAPI."*
+
+The Operations BB's **STAC Scenario** already states the concrete expectation, so the
+"clarification" is really *"adopt what the Operations BB already specifies and close the gap it
+names."* This delivery therefore produces:
+
+1. A written **contract** — what eoAPI exposes vs. what the Operations BB provides.
+2. The **first concrete step** of the BB's stated desired end state: native, bounded application
+   metrics from `eoapi-stac` and `eoapi-stac-auth-proxy`, a `ServiceMonitor`, and dashboard/alert
+   panels built on them.
+
+**Delivered in eoAPI itself** — the apps (`stac-fastapi-pgstac`, `stac-auth-proxy`) and the
+`eoapi-k8s` chart we maintain — as a **generic, opt-in capability**. **EOEPCA is then reached
+with a few Helm-values changes.** Doing it upstream (not as an EOEPCA-only overlay) keeps it
+reusable and low-maintenance for every eoAPI deployment.
+
+## How the Operations BB monitors eoAPI today
+
+From the Operations BB docs, the request path is
+`client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack
+is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There
+is no tracing backend (no Tempo).**
+
+> Deployed-service ↔ upstream-package names: **`eoapi-stac`** = `stac-fastapi-pgstac`,
+> **`eoapi-stac-auth-proxy`** = `stac-auth-proxy` (both FastAPI/Starlette apps we maintain).
+
+What already exists:
+
+| Signal | Source today | Notes |
+|---|---|---|
+| **Gateway metrics** | **APISIX** `/apisix/prometheus/metrics`, scraped by a `ServiceMonitor` in `ingress-apisix` | `apisix_http_status` (by code/route/method) and `apisix_http_latency` (`type=request\|upstream\|apisix`) — bounded labels; three latency views |
+| **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` |
+| **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads |
+| **Synthetic checks** | external black-box probe | validates the public STAC endpoint |
+| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records exist for **both GET and POST** (`stac_get_latency_500ms_burn_rate_1h` / `stac_post_latency_500ms_burn_rate_1h`, critical at `> 14.4`). **POST `/search` is the latency-critical path**; gateway/DB records serve as diagnosis |
+
+> **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s`
+> chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not
+> present** on EOEPCA, so any eoAPI-shipped dashboard uses APISIX (or native app) metrics.
+
+### The gap the Operations BB names
+- `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`.
+- There are **no `ServiceMonitor` objects in the `data-access` namespace**.
+- Alerts/SLOs rely on **indirect gateway signals**; the BB explicitly wants *"semantic,
+  low-cardinality metrics from the application itself, then scrape them with a `ServiceMonitor`"*
+  to measure request rate, errors and latency **directly from the service**.
+
+So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application
+metrics**. That is what this delivery contributes.
+
+### Design implications from the Operations BB demo
+The Operations BB team demoed the STAC SLO workflow at the
+[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link)
+(recording is access-restricted Google Drive). Points that shape this design:
+
+- **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC
+  request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only
+  sees GET/POST, and putting the **URL in a label is a cardinality problem they tried and backed
+  off**. Native metrics keyed by **route template (= operation)** close this safely.
+- **POST `/search` is the latency-critical path** ("most of the stuff that takes longer is a
+  POST"); the SLO burn-rate alert is on STAC POST latency. Instrument and surface it prominently.
+- **Minimum bar = "what the framework gives for free."** ESA and the BB owner agreed the baseline
+  is enabling the out-of-the-box FastAPI route/method/latency-bucket metrics (a library + little
+  code) — exactly the opt-in `/metrics` proposed here.
+- **Alerting is SLO / user-impact-centric, not CPU/memory.** Per-workload CPU/mem already comes
+  for free and is *not* alerted on; eoAPI's contribution is the RED signals behind the SLO and the
+  dashboard the alert links to.
+- **Keep enrichment correlates gateway vs app vs DB latency** to localize the bottleneck (e.g.
+  "DB mean 11 ms → not the DB; app burn 2.2 → app fine"). Native **app-layer** latency makes the
+  "is it the app?" branch accurate instead of inferred from APISIX upstream latency. eoAPI's
+  metrics/rules should carry **stable labels** so Keep can pull and correlate them.
+
+## The contract — what eoAPI provides vs. what the platform provides
+
+This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first
+consumer.
+
+| Capability | Provided by | Detail |
+|---|---|---|
+| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`, on the **existing app port (8080), cluster-internal only** (scraped by the ServiceMonitor; **not** routed through the public APISIX ingress). Request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. |
+| **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. |
+| **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. |
+| **Logs** | **eoAPI** | Logs to **stdout**, collected **as-is** by the platform's shipper (Alloy→Loki); eoAPI does not push logs. Structured/JSON logging is a later increment, not part of this delivery. |
+| **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). |
+| **DB metrics** | **Platform** | postgres-exporter. |
+| **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. |
+| **Log pipeline** | **Platform** | Alloy → Loki. |
+
+**Boundaries**
+- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's; the
+  chart's bundled monitoring components stay disabled where a platform stack exists.
+- **Cardinality is bounded by design** — route template / method / status class only; **no full
+  URLs, no per-collection / per-tile labels** (the BB calls out high-cardinality churn explicitly;
+  EOEPCA has been bitten before). Per-collection analytics are out of scope (see roadmap).
+
+### EOEPCA integration = a few config changes
+Because the capability lives in the apps + chart, bringing it to EOEPCA is **values only** in
+`eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`):
+1. **Enable** native metrics (env) on `eoapi-stac` / `eoapi-stac-auth-proxy`.
+2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana
+   sidecar label/namespace (`eoapi_dashboard`).
+3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new
+   ServiceMonitor). The Ops BB may then **choose** to rebase its STAC burn-rate records onto the
+   native request metrics — that is the BB's decision; eoAPI only exposes the signal.
+
+No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values.
+
+## First delivery — scope and effort (~15h, the STAC slice)
+
+The STAC Scenario is the BB's worked example, so the first increment targets it end-to-end.
+
+| # | Task | Where | Effort |
+|---|---|---|---|
+| 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h |
+| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h |
+| 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h |
+| 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h |
+| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how the native series can back the existing 99%/500ms SLO — the BB owns whether to rebase its rules) | docs | 1h |
+| | **Total** | | **15h** |
+
+**Dependency order (critical path):** app `/metrics` released (tasks 2–3, repos
+`stac-fastapi-pgstac` + `stac-auth-proxy`) → chart `ServiceMonitor` + dashboard (task 4,
+`eoapi-k8s`) → EOEPCA values enablement (`eoepca-plus`). The clarification (task 1) and the
+chart work (task 4) can proceed in parallel.
+
+**Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app
+release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the
+clarification (the #202 AC) can land first while the `/metrics` PRs go through.
+
+## EOEPCA enablement (values only)
+A follow-on values PR to `eoepca-plus` enables metrics + ServiceMonitor + dashboard and confirms
+the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/chart work.)
+
+## Out of scope (future roadmap)
+- **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac),
+  vector (tipg), multidim.
+- **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA.
+- **Per-collection analytics** (eoAPI#193) — a real root-cause need (the Ops BB demo's example:
+  "only the VHR data is slow"), but high cardinality and needs service-specific knowledge + a
+  bounded design and budget agreed with the Operations BB.
+
+## Upstream improvements (do-it-well, beyond the first delivery)
+
+Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home is upstream. Estimates are
+engineering hours (tests/docs included); upstream PRs also carry review/CI/release latency.
+
+| # | Item | Effort |
+|---|---|---|
+| U1 | Extend native `/metrics` to **raster / vector / multidim** (same pattern as STAC) | 4–6h per app |
+| U2 | First-class `eoapi-k8s` **`observability`/`telemetry` values block** (toggle metrics + ServiceMonitor + dashboard + standard `OTEL_*` env passthrough) | 6–9h |
+| U3 | Opt-in **OpenTelemetry / traces** baked into images (env-driven, off by default); riskiest (dependency-conflict + uvicorn multi-worker `WEB_CONCURRENCY` validation); only once traces are wanted **and** Tempo exists | 5–7h per app |
+| U4 | Bounded, opt-in **per-collection metric** (#193) with cardinality guards + dashboard panel | 8–14h |
+
+## Verification / acceptance
+- **AC met:** the contract is reviewed and accepted by the Operations BB — the #202 checkbox.
+- **Gap closed:** `GET /metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy` returns **200**
+  (not 404), exposing bounded, route-templated series.
+- **Scraped:** the `ServiceMonitor` is picked up by the cluster Prometheus
+  (`kubectl get servicemonitor -n data-access`; series present).
+- **Dashboard:** native rate / error / latency-percentile per-service panels populate in the
+  existing Grafana via the shipped ConfigMap.
+- **SLO:** the native request series are suitable to back the existing **99% < 500ms** SLO
+  (`stac_get_latency_500ms_*` / `stac_post_latency_500ms_*`); whether the Ops BB rebases its
+  burn-rate rules onto them, instead of the indirect gateway signal, is the BB's decision.
+- **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them).
+- **Cardinality check:** no full-URL / per-collection labels on any new series.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
     - Design:
       - design/overview.md
       - design/ogc-api-maps.md
+      - design/observability-operations-bb.md
     - API:
       - api/endpoint-specification.md
       - api/health-checks.md