diff --git a/docs/design/observability-operations-bb.md b/docs/design/observability-operations-bb.md new file mode 100644 index 0000000..6e3c6d9 --- /dev/null +++ b/docs/design/observability-operations-bb.md @@ -0,0 +1,179 @@ +# eoAPI Monitoring & Logging (Operations BB integration) + +> **Status: Draft for discussion.** Focused first-delivery design for +> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), aligned with the +> [Operations BB documentation](https://eoepca.readthedocs.io/projects/operations/en/latest/) +> (in particular its *STAC Scenario*, *ServiceMonitors*, and *Alerting and SLOs* pages). +> Scoped to a single increment (~15h). Tracing and per-collection analytics are future roadmap. + +## Goal & acceptance criterion + +data-access#202 asks for one thing: + +> *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging +> capabilities of eoAPI."* + +The Operations BB's **STAC Scenario** already states the concrete expectation, so the +"clarification" is really *"adopt what the Operations BB already specifies and close the gap it +names."* This delivery therefore produces: + +1. A written **contract** — what eoAPI exposes vs. what the Operations BB provides. +2. The **first concrete step** of the BB's stated desired end state: native, bounded application + metrics from `eoapi-stac` and `eoapi-stac-auth-proxy`, a `ServiceMonitor`, and dashboard/alert + panels built on them. + +**Delivered in eoAPI itself** — the apps (`stac-fastapi-pgstac`, `stac-auth-proxy`) and the +`eoapi-k8s` chart we maintain — as a **generic, opt-in capability**. **EOEPCA is then reached +with a few Helm-values changes.** Doing it upstream (not as an EOEPCA-only overlay) keeps it +reusable and low-maintenance for every eoAPI deployment. + +## How the Operations BB monitors eoAPI today + +From the Operations BB docs, the request path is +`client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack +is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There +is no tracing backend (no Tempo).** + +> Deployed-service ↔ upstream-package names: **`eoapi-stac`** = `stac-fastapi-pgstac`, +> **`eoapi-stac-auth-proxy`** = `stac-auth-proxy` (both FastAPI/Starlette apps we maintain). + +What already exists: + +| Signal | Source today | Notes | +|---|---|---| +| **Gateway metrics** | **APISIX** `/apisix/prometheus/metrics`, scraped by a `ServiceMonitor` in `ingress-apisix` | `apisix_http_status` (by code/route/method) and `apisix_http_latency` (`type=request\|upstream\|apisix`) — bounded labels; three latency views | +| **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` | +| **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads | +| **Synthetic checks** | external black-box probe | validates the public STAC endpoint | +| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records exist for **both GET and POST** (`stac_get_latency_500ms_burn_rate_1h` / `stac_post_latency_500ms_burn_rate_1h`, critical at `> 14.4`). **POST `/search` is the latency-critical path**; gateway/DB records serve as diagnosis | + +> **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s` +> chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not +> present** on EOEPCA, so any eoAPI-shipped dashboard uses APISIX (or native app) metrics. + +### The gap the Operations BB names +- `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`. +- There are **no `ServiceMonitor` objects in the `data-access` namespace**. +- Alerts/SLOs rely on **indirect gateway signals**; the BB explicitly wants *"semantic, + low-cardinality metrics from the application itself, then scrape them with a `ServiceMonitor`"* + to measure request rate, errors and latency **directly from the service**. + +So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application +metrics**. That is what this delivery contributes. + +### Design implications from the Operations BB demo +The Operations BB team demoed the STAC SLO workflow at the +[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link) +(recording is access-restricted Google Drive). Points that shape this design: + +- **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC + request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only + sees GET/POST, and putting the **URL in a label is a cardinality problem they tried and backed + off**. Native metrics keyed by **route template (= operation)** close this safely. +- **POST `/search` is the latency-critical path** ("most of the stuff that takes longer is a + POST"); the SLO burn-rate alert is on STAC POST latency. Instrument and surface it prominently. +- **Minimum bar = "what the framework gives for free."** ESA and the BB owner agreed the baseline + is enabling the out-of-the-box FastAPI route/method/latency-bucket metrics (a library + little + code) — exactly the opt-in `/metrics` proposed here. +- **Alerting is SLO / user-impact-centric, not CPU/memory.** Per-workload CPU/mem already comes + for free and is *not* alerted on; eoAPI's contribution is the RED signals behind the SLO and the + dashboard the alert links to. +- **Keep enrichment correlates gateway vs app vs DB latency** to localize the bottleneck (e.g. + "DB mean 11 ms → not the DB; app burn 2.2 → app fine"). Native **app-layer** latency makes the + "is it the app?" branch accurate instead of inferred from APISIX upstream latency. eoAPI's + metrics/rules should carry **stable labels** so Keep can pull and correlate them. + +## The contract — what eoAPI provides vs. what the platform provides + +This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first +consumer. + +| Capability | Provided by | Detail | +|---|---|---| +| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`, on the **existing app port (8080), cluster-internal only** (scraped by the ServiceMonitor; **not** routed through the public APISIX ingress). Request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. | +| **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. | +| **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. | +| **Logs** | **eoAPI** | Logs to **stdout**, collected **as-is** by the platform's shipper (Alloy→Loki); eoAPI does not push logs. Structured/JSON logging is a later increment, not part of this delivery. | +| **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). | +| **DB metrics** | **Platform** | postgres-exporter. | +| **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. | +| **Log pipeline** | **Platform** | Alloy → Loki. | + +**Boundaries** +- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's; the + chart's bundled monitoring components stay disabled where a platform stack exists. +- **Cardinality is bounded by design** — route template / method / status class only; **no full + URLs, no per-collection / per-tile labels** (the BB calls out high-cardinality churn explicitly; + EOEPCA has been bitten before). Per-collection analytics are out of scope (see roadmap). + +### EOEPCA integration = a few config changes +Because the capability lives in the apps + chart, bringing it to EOEPCA is **values only** in +`eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`): +1. **Enable** native metrics (env) on `eoapi-stac` / `eoapi-stac-auth-proxy`. +2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana + sidecar label/namespace (`eoapi_dashboard`). +3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new + ServiceMonitor). The Ops BB may then **choose** to rebase its STAC burn-rate records onto the + native request metrics — that is the BB's decision; eoAPI only exposes the signal. + +No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values. + +## First delivery — scope and effort (~15h, the STAC slice) + +The STAC Scenario is the BB's worked example, so the first increment targets it end-to-end. + +| # | Task | Where | Effort | +|---|---|---|---| +| 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h | +| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h | +| 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h | +| 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h | +| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how the native series can back the existing 99%/500ms SLO — the BB owns whether to rebase its rules) | docs | 1h | +| | **Total** | | **15h** | + +**Dependency order (critical path):** app `/metrics` released (tasks 2–3, repos +`stac-fastapi-pgstac` + `stac-auth-proxy`) → chart `ServiceMonitor` + dashboard (task 4, +`eoapi-k8s`) → EOEPCA values enablement (`eoepca-plus`). The clarification (task 1) and the +chart work (task 4) can proceed in parallel. + +**Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app +release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the +clarification (the #202 AC) can land first while the `/metrics` PRs go through. + +## EOEPCA enablement (values only) +A follow-on values PR to `eoepca-plus` enables metrics + ServiceMonitor + dashboard and confirms +the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/chart work.) + +## Out of scope (future roadmap) +- **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac), + vector (tipg), multidim. +- **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA. +- **Per-collection analytics** (eoAPI#193) — a real root-cause need (the Ops BB demo's example: + "only the VHR data is slow"), but high cardinality and needs service-specific knowledge + a + bounded design and budget agreed with the Operations BB. + +## Upstream improvements (do-it-well, beyond the first delivery) + +Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home is upstream. Estimates are +engineering hours (tests/docs included); upstream PRs also carry review/CI/release latency. + +| # | Item | Effort | +|---|---|---| +| U1 | Extend native `/metrics` to **raster / vector / multidim** (same pattern as STAC) | 4–6h per app | +| U2 | First-class `eoapi-k8s` **`observability`/`telemetry` values block** (toggle metrics + ServiceMonitor + dashboard + standard `OTEL_*` env passthrough) | 6–9h | +| U3 | Opt-in **OpenTelemetry / traces** baked into images (env-driven, off by default); riskiest (dependency-conflict + uvicorn multi-worker `WEB_CONCURRENCY` validation); only once traces are wanted **and** Tempo exists | 5–7h per app | +| U4 | Bounded, opt-in **per-collection metric** (#193) with cardinality guards + dashboard panel | 8–14h | + +## Verification / acceptance +- **AC met:** the contract is reviewed and accepted by the Operations BB — the #202 checkbox. +- **Gap closed:** `GET /metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy` returns **200** + (not 404), exposing bounded, route-templated series. +- **Scraped:** the `ServiceMonitor` is picked up by the cluster Prometheus + (`kubectl get servicemonitor -n data-access`; series present). +- **Dashboard:** native rate / error / latency-percentile per-service panels populate in the + existing Grafana via the shipped ConfigMap. +- **SLO:** the native request series are suitable to back the existing **99% < 500ms** SLO + (`stac_get_latency_500ms_*` / `stac_post_latency_500ms_*`); whether the Ops BB rebases its + burn-rate rules onto them, instead of the indirect gateway signal, is the BB's decision. +- **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them). +- **Cardinality check:** no full-URL / per-collection labels on any new series. diff --git a/mkdocs.yml b/mkdocs.yml index 3b48f60..5829e5d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,6 +10,7 @@ nav: - Design: - design/overview.md - design/ogc-api-maps.md + - design/observability-operations-bb.md - API: - api/endpoint-specification.md - api/health-checks.md