Skip to content
179 changes: 179 additions & 0 deletions docs/design/observability-operations-bb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# eoAPI Monitoring & Logging (Operations BB integration)

> **Status: Draft for discussion.** Focused first-delivery design for
> [EOEPCA/data-access#202](https://github.com/EOEPCA/data-access/issues/202), aligned with the
> [Operations BB documentation](https://eoepca.readthedocs.io/projects/operations/en/latest/)
> (in particular its *STAC Scenario*, *ServiceMonitors*, and *Alerting and SLOs* pages).
> Scoped to a single increment (~15h). Tracing and per-collection analytics are future roadmap.

## Goal & acceptance criterion

data-access#202 asks for one thing:

> *"Clarified dependencies and expectations of the Operations BB towards monitoring and logging
> capabilities of eoAPI."*

The Operations BB's **STAC Scenario** already states the concrete expectation, so the
"clarification" is really *"adopt what the Operations BB already specifies and close the gap it
names."* This delivery therefore produces:

1. A written **contract** — what eoAPI exposes vs. what the Operations BB provides.
2. The **first concrete step** of the BB's stated desired end state: native, bounded application
metrics from `eoapi-stac` and `eoapi-stac-auth-proxy`, a `ServiceMonitor`, and dashboard/alert
panels built on them.

**Delivered in eoAPI itself** — the apps (`stac-fastapi-pgstac`, `stac-auth-proxy`) and the
`eoapi-k8s` chart we maintain — as a **generic, opt-in capability**. **EOEPCA is then reached
with a few Helm-values changes.** Doing it upstream (not as an EOEPCA-only overlay) keeps it
reusable and low-maintenance for every eoAPI deployment.

## How the Operations BB monitors eoAPI today

From the Operations BB docs, the request path is
`client → APISIX route → eoapi-stac-auth-proxy → eoapi-stac → pgSTAC`, and the monitoring stack
is **Prometheus + Grafana + Alertmanager + Loki + Grafana Alloy + Keep** (alert triage). **There
is no tracing backend (no Tempo).**

> Deployed-service ↔ upstream-package names: **`eoapi-stac`** = `stac-fastapi-pgstac`,
> **`eoapi-stac-auth-proxy`** = `stac-auth-proxy` (both FastAPI/Starlette apps we maintain).

What already exists:

| Signal | Source today | Notes |
|---|---|---|
| **Gateway metrics** | **APISIX** `/apisix/prometheus/metrics`, scraped by a `ServiceMonitor` in `ingress-apisix` | `apisix_http_status` (by code/route/method) and `apisix_http_latency` (`type=request\|upstream\|apisix`) — bounded labels; three latency views |
| **DB metrics** | postgres-exporter | e.g. `avg(ccp_pg_stat_statements_total_mean_exec_time_ms{dbname="eoapi",role="eoapi"})` |
| **Logs** | **Alloy → Loki** | auto-collects pod stdout for the `data-access` workloads |
| **Synthetic checks** | external black-box probe | validates the public STAC endpoint |
| **SLO + alerts** | PrometheusRules + Alertmanager + Keep | STAC SLO **99% of requests < 500ms**; multi-window burn-rate records exist for **both GET and POST** (`stac_get_latency_500ms_burn_rate_1h` / `stac_post_latency_500ms_burn_rate_1h`, critical at `> 14.4`). **POST `/search` is the latency-critical path**; gateway/DB records serve as diagnosis |

> **Note on ingress:** EOEPCA fronts eoAPI with **APISIX**, not nginx-ingress. The `eoapi-k8s`
> chart's autoscaling/dashboard rely on `nginx_ingress_controller_*` metrics, which are **not
> present** on EOEPCA, so any eoAPI-shipped dashboard uses APISIX (or native app) metrics.

### The gap the Operations BB names
- `/metrics` returns **404** on `eoapi-stac` and `eoapi-stac-auth-proxy`.
- There are **no `ServiceMonitor` objects in the `data-access` namespace**.
- Alerts/SLOs rely on **indirect gateway signals**; the BB explicitly wants *"semantic,
low-cardinality metrics from the application itself, then scrape them with a `ServiceMonitor`"*
to measure request rate, errors and latency **directly from the service**.

So the gateway/DB/log/SLO machinery already exists — **the missing piece is native application
metrics**. That is what this delivery contributes.

### Design implications from the Operations BB demo
The Operations BB team demoed the STAC SLO workflow at the
[EOEPCA+ demo (8 May 2026)](https://drive.google.com/drive/folders/1lvPqXoW1-fMYNZVvfJw3LPjwb38nt2BS?usp=drive_link)
(recording is access-restricted Google Drive). Points that shape this design:

- **Operation-level latency is the signal that's missing.** From outside, the BB can see a STAC
request is slow but **cannot tell search vs item-listing vs collection-listing** — APISIX only
sees GET/POST, and putting the **URL in a label is a cardinality problem they tried and backed
off**. Native metrics keyed by **route template (= operation)** close this safely.
- **POST `/search` is the latency-critical path** ("most of the stuff that takes longer is a
POST"); the SLO burn-rate alert is on STAC POST latency. Instrument and surface it prominently.
- **Minimum bar = "what the framework gives for free."** ESA and the BB owner agreed the baseline
is enabling the out-of-the-box FastAPI route/method/latency-bucket metrics (a library + little
code) — exactly the opt-in `/metrics` proposed here.
- **Alerting is SLO / user-impact-centric, not CPU/memory.** Per-workload CPU/mem already comes
for free and is *not* alerted on; eoAPI's contribution is the RED signals behind the SLO and the
dashboard the alert links to.
- **Keep enrichment correlates gateway vs app vs DB latency** to localize the bottleneck (e.g.
"DB mean 11 ms → not the DB; app burn 2.2 → app fine"). Native **app-layer** latency makes the
"is it the app?" branch accurate instead of inferred from APISIX upstream latency. eoAPI's
metrics/rules should carry **stable labels** so Keep can pull and correlate them.

## The contract — what eoAPI provides vs. what the platform provides

This is the acceptance artifact. Written for **eoAPI in general**; the Operations BB is the first
consumer.

| Capability | Provided by | Detail |
|---|---|---|
| **Native app metrics** | **eoAPI** | Opt-in Prometheus `/metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy`, on the **existing app port (8080), cluster-internal only** (scraped by the ServiceMonitor; **not** routed through the public APISIX ingress). Request **counters + duration histograms** keyed by **route template (= STAC operation: search / item-listing / collection-listing), method, status class** (+ proxy **auth-decision** / cache outcomes). Bounded labels only — no URL. |
| **`ServiceMonitor`** | **eoAPI** (`eoapi-k8s`) | Opt-in template with stable labels so any Prometheus-Operator platform discovers the endpoints. |
| **Dashboard + delivery** | **eoAPI** (`eoapi-k8s`) | Extend the bundled dashboard (ConfigMap mechanism, label `eoapi_dashboard`) with native rate/error/latency panels. |
| **Logs** | **eoAPI** | Logs to **stdout**, collected **as-is** by the platform's shipper (Alloy→Loki); eoAPI does not push logs. Structured/JSON logging is a later increment, not part of this delivery. |
| **Gateway metrics** | **Platform** | APISIX `apisix_http_*` (already scraped). |
| **DB metrics** | **Platform** | postgres-exporter. |
| **Metrics backend + alerting** | **Platform** | Prometheus-Operator, Grafana (dashboard sidecar), Alertmanager, **Keep**; owns **SLOs, burn-rate rules, retention, access, alert routing**. |
| **Log pipeline** | **Platform** | Alloy → Loki. |

**Boundaries**
- eoAPI does **not** run its own Prometheus/Grafana/Loki — it integrates with the platform's; the
chart's bundled monitoring components stay disabled where a platform stack exists.
- **Cardinality is bounded by design** — route template / method / status class only; **no full
URLs, no per-collection / per-tile labels** (the BB calls out high-cardinality churn explicitly;
EOEPCA has been bitten before). Per-collection analytics are out of scope (see roadmap).

### EOEPCA integration = a few config changes
Because the capability lives in the apps + chart, bringing it to EOEPCA is **values only** in
`eoepca-plus` (`argocd/eoepca/data-access/parts/values/values-eoapi.yaml`):
1. **Enable** native metrics (env) on `eoapi-stac` / `eoapi-stac-auth-proxy`.
2. **Enable** the chart's `ServiceMonitor` + extended dashboard ConfigMap; match the Grafana
sidecar label/namespace (`eoapi_dashboard`).
3. Leave the chart's bundled Prometheus **disabled** (the cluster stack scrapes via the new
ServiceMonitor). The Ops BB may then **choose** to rebase its STAC burn-rate records onto the
native request metrics — that is the BB's decision; eoAPI only exposes the signal.

No image rebuilds beyond shipping the instrumented app version; ArgoCD syncs the values.

## First delivery — scope and effort (~15h, the STAC slice)

The STAC Scenario is the BB's worked example, so the first increment targets it end-to-end.

| # | Task | Where | Effort |
|---|---|---|---|
| 1 | Write the **contract** (this doc) and circulate for Operations BB review | docs | 3h |
| 2 | Add opt-in, **route-templated `/metrics`** to `eoapi-stac` (request count + duration histogram; labels: route template = operation [search/items/collections], method, status class; **surface POST `/search` latency**) — reuse `prometheus-fastapi-instrumentator` or equivalent, off by default | `stac-fastapi-pgstac` (app) | 4h |
| 3 | Add **proxy metrics** to `eoapi-stac-auth-proxy` (request/latency + **auth-decision** outcomes) — new middleware alongside the existing stack (e.g. next to `AddProcessTimeHeaderMiddleware`), off by default | `stac-auth-proxy` (app) | 4h |
| 4 | Add an opt-in **`ServiceMonitor`** (stable labels) + extend the bundled dashboard with native rate/error/latency panels | `eoapi-k8s` (chart) | 3h |
| 5 | Wire-up note + short **#202 note**; hand off for Operations BB sign-off (incl. how the native series can back the existing 99%/500ms SLO — the BB owns whether to rebase its rules) | docs | 1h |
| | **Total** | | **15h** |

**Dependency order (critical path):** app `/metrics` released (tasks 2–3, repos
`stac-fastapi-pgstac` + `stac-auth-proxy`) → chart `ServiceMonitor` + dashboard (task 4,
`eoapi-k8s`) → EOEPCA values enablement (`eoepca-plus`). The clarification (task 1) and the
chart work (task 4) can proceed in parallel.

**Schedule note (honest):** tasks 2–3 are upstream app changes, so the real constraint is **app
release cadence**, not the engineering hours. If a release is slow, tasks 1, 4 and the
clarification (the #202 AC) can land first while the `/metrics` PRs go through.

## EOEPCA enablement (values only)
A follow-on values PR to `eoepca-plus` enables metrics + ServiceMonitor + dashboard and confirms
the scrape — no code, ArgoCD-synced. (Counted separately from the 15h app/chart work.)

## Out of scope (future roadmap)
- **Other services** — extend the same native-`/metrics` pattern to raster (titiler-pgstac),
vector (tipg), multidim.
- **Tracing** (OpenTelemetry/OTLP) — also blocked by **Tempo not being deployed** on EOEPCA.
- **Per-collection analytics** (eoAPI#193) — a real root-cause need (the Ops BB demo's example:
"only the VHR data is slow"), but high cardinality and needs service-specific knowledge + a
bounded design and budget agreed with the Operations BB.

## Upstream improvements (do-it-well, beyond the first delivery)

Because we maintain eoAPI and eoapi-k8s, the lowest-maintenance home is upstream. Estimates are
engineering hours (tests/docs included); upstream PRs also carry review/CI/release latency.

| # | Item | Effort |
|---|---|---|
| U1 | Extend native `/metrics` to **raster / vector / multidim** (same pattern as STAC) | 4–6h per app |
| U2 | First-class `eoapi-k8s` **`observability`/`telemetry` values block** (toggle metrics + ServiceMonitor + dashboard + standard `OTEL_*` env passthrough) | 6–9h |
| U3 | Opt-in **OpenTelemetry / traces** baked into images (env-driven, off by default); riskiest (dependency-conflict + uvicorn multi-worker `WEB_CONCURRENCY` validation); only once traces are wanted **and** Tempo exists | 5–7h per app |
| U4 | Bounded, opt-in **per-collection metric** (#193) with cardinality guards + dashboard panel | 8–14h |

## Verification / acceptance
- **AC met:** the contract is reviewed and accepted by the Operations BB — the #202 checkbox.
- **Gap closed:** `GET /metrics` on `eoapi-stac` and `eoapi-stac-auth-proxy` returns **200**
(not 404), exposing bounded, route-templated series.
- **Scraped:** the `ServiceMonitor` is picked up by the cluster Prometheus
(`kubectl get servicemonitor -n data-access`; series present).
- **Dashboard:** native rate / error / latency-percentile per-service panels populate in the
existing Grafana via the shipped ConfigMap.
- **SLO:** the native request series are suitable to back the existing **99% < 500ms** SLO
(`stac_get_latency_500ms_*` / `stac_post_latency_500ms_*`); whether the Ops BB rebases its
burn-rate rules onto them, instead of the indirect gateway signal, is the BB's decision.
- **Logs:** still queryable in the existing Loki (unchanged; Alloy already collects them).
- **Cardinality check:** no full-URL / per-collection labels on any new series.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ nav:
- Design:
- design/overview.md
- design/ogc-api-maps.md
- design/observability-operations-bb.md
- API:
- api/endpoint-specification.md
- api/health-checks.md
Expand Down