Skip to content

docs: eoAPI Monitoring & Logging — Operations BB integration (#202)#217

Draft
lhoupert wants to merge 8 commits into
mainfrom
docs/observability-operations-bb
Draft

docs: eoAPI Monitoring & Logging — Operations BB integration (#202)#217
lhoupert wants to merge 8 commits into
mainfrom
docs/observability-operations-bb

Conversation

@lhoupert
Copy link
Copy Markdown
Collaborator

@lhoupert lhoupert commented Jun 4, 2026

Summary

Adds a focused design doc — docs/design/observability-operations-bb.md — for the first increment of monitoring & logging integration with the EOEPCA Operations BB (#202). Registered under Design in the nav.

It is deliberately narrow and eoAPI-first: the work lands in the apps and the eoapi-k8s chart we maintain as a generic, opt-in capability, and EOEPCA is reached with a few Helm-values changes.

What it proposes

Aligned with the Operations BB docs (STAC Scenario, ServiceMonitors, Alerting & SLOs) and the EOEPCA+ demo (8 May 2026):

  • The BB already has APISIX gateway metrics, postgres-exporter, Alloy→Loki logs, synthetic checks, and an SLO/burn-rate + Keep workflow. The named gap is native application metrics (/metrics is 404 on eoapi-stac and eoapi-stac-auth-proxy; no data-access ServiceMonitor).
  • First delivery (~15h, STAC slice): opt-in, bounded, route-templated /metrics (operation = search / item-listing / collection-listing — the signal APISIX can't give safely) on eoapi-stac + eoapi-stac-auth-proxy, an opt-in ServiceMonitor, and dashboard/alert panels — feeding the existing 99% < 500ms SLO.
  • Bounded cardinality (no URL/per-collection labels); cluster-internal /metrics; logs unchanged (Alloy already collects them).
  • Out of scope / roadmap: other services, tracing (no Tempo deployed), per-collection analytics.

Acceptance criterion (#202)

The doc itself is the clarification of the eoAPI ↔ Operations BB monitoring/logging dependency — the single AC on #202. Implementation (app /metrics + chart ServiceMonitor + EOEPCA values) is the follow-on, with the critical path and cross-repo dependencies documented.

Asking reviewers

Feedback welcome on the contract (eoAPI vs platform split), the metric/label shape, and the STAC-slice scope. Flagging @achtsnits / @pantierra.

Relates to #202 and developmentseed/eoAPI#193.

🤖 Generated with Claude Code

lhoupert and others added 8 commits June 4, 2026 15:10
…ccess#202)

Add a narrow, ~15h-scoped first-delivery design for #202: clarify the
eoAPI ↔ Operations BB monitoring/logging contract and make eoAPI's
existing metrics part of the Operations BB by reusing what already
exists (ingress metrics already scraped by kube-prometheus-stack, logs
already shipped by Alloy, the chart's existing dashboard ConfigMap
mechanism) — no new images, app changes, or backends.

Includes the integration contract, a 15h task breakdown, an explicit
out-of-scope/roadmap pointer, and costed upstream-improvement proposals
for eoapi-k8s and the eoAPI apps. Registered under Design in the nav.
Leaves the broader observability roadmap doc untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Avoid confusion: the broader roadmap doc was never published, so drop the
cross-links and keep this design self-contained.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Center the deliverable on eoAPI (the eoapi-k8s chart) as a generic,
opt-in capability for any Prometheus-Operator + Grafana platform, with
EOEPCA reached via a few Helm-values changes rather than bespoke code.

- Add a "Provided by" column splitting what eoAPI ships vs what the
  platform provides (the heart of the contract).
- Add an explicit "EOEPCA integration = a few config changes" section.
- Scope the ServiceMonitor to the later app-/metrics increment (U1/U3):
  the 15h delivery reads ingress metrics already scraped, so it needs
  only dashboard panels + alert rules in the chart.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reviewed against the Operations BB docs (STAC Scenario, ServiceMonitors,
Alerting & SLOs) and corrected the design:

- Ingress is APISIX, not nginx — the chart's nginx_ingress_controller_*
  metrics/autoscaling don't exist on EOEPCA; document APISIX
  (apisix_http_status / apisix_http_latency) + postgres-exporter + Keep.
- The BB already has gateway/DB/log/SLO machinery; the named gap is
  native app metrics (/metrics returns 404 on eoapi-stac and
  stac-auth-proxy; no data-access ServiceMonitor).
- Promote native, bounded /metrics + ServiceMonitor + dashboard/alert
  panels (the STAC slice) into the first delivery; this matches the BB's
  stated desired end state and feeds the existing 99%/500ms burn-rate SLO.
- Keep bounded-cardinality, Alloy logs, platform ownership; defer traces
  (no Tempo) and per-collection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
From the Ops BB (Versioneer/ESA) STAC SLO demo:
- The missing signal is operation-level latency (search vs item-listing
  vs collection-listing); URL-as-label is the cardinality trap they
  backed off — native route-template metrics solve it safely.
- POST /search is the latency-critical path the SLO alert targets.
- Agreed minimum bar: enable the framework's out-of-the-box metrics.
- Alerting is SLO/user-impact-centric, not CPU/memory.
- Keep enrichment correlates gateway/app/DB latency; native app-layer
  latency makes the "is it the app?" branch accurate.
- Per-collection (e.g. only VHR slow) is a real but deeper, deferred need.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop revision-style framing so the document reads as an original
proposal (no implied earlier published version).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Clarify GET vs POST burn-rate records; POST /search is the critical path.
- Note /metrics is cluster-internal only (port 8080), not via public APISIX
  ingress (matters for the auth proxy).
- Make burn-rate rebasing the Ops BB's decision; eoAPI only exposes signals.
- Soften the logs row (collected as-is; structured logging is later).
- Add deployed-name <-> upstream-package mapping and cross-repo dependency
  order; flag the demo Drive link as access-restricted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant