docs: eoAPI Monitoring & Logging — Operations BB integration (#202)#217
Draft
lhoupert wants to merge 8 commits into
Draft
docs: eoAPI Monitoring & Logging — Operations BB integration (#202)#217lhoupert wants to merge 8 commits into
lhoupert wants to merge 8 commits into
Conversation
…ccess#202) Add a narrow, ~15h-scoped first-delivery design for #202: clarify the eoAPI ↔ Operations BB monitoring/logging contract and make eoAPI's existing metrics part of the Operations BB by reusing what already exists (ingress metrics already scraped by kube-prometheus-stack, logs already shipped by Alloy, the chart's existing dashboard ConfigMap mechanism) — no new images, app changes, or backends. Includes the integration contract, a 15h task breakdown, an explicit out-of-scope/roadmap pointer, and costed upstream-improvement proposals for eoapi-k8s and the eoAPI apps. Registered under Design in the nav. Leaves the broader observability roadmap doc untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Avoid confusion: the broader roadmap doc was never published, so drop the cross-links and keep this design self-contained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Center the deliverable on eoAPI (the eoapi-k8s chart) as a generic, opt-in capability for any Prometheus-Operator + Grafana platform, with EOEPCA reached via a few Helm-values changes rather than bespoke code. - Add a "Provided by" column splitting what eoAPI ships vs what the platform provides (the heart of the contract). - Add an explicit "EOEPCA integration = a few config changes" section. - Scope the ServiceMonitor to the later app-/metrics increment (U1/U3): the 15h delivery reads ingress metrics already scraped, so it needs only dashboard panels + alert rules in the chart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reviewed against the Operations BB docs (STAC Scenario, ServiceMonitors, Alerting & SLOs) and corrected the design: - Ingress is APISIX, not nginx — the chart's nginx_ingress_controller_* metrics/autoscaling don't exist on EOEPCA; document APISIX (apisix_http_status / apisix_http_latency) + postgres-exporter + Keep. - The BB already has gateway/DB/log/SLO machinery; the named gap is native app metrics (/metrics returns 404 on eoapi-stac and stac-auth-proxy; no data-access ServiceMonitor). - Promote native, bounded /metrics + ServiceMonitor + dashboard/alert panels (the STAC slice) into the first delivery; this matches the BB's stated desired end state and feeds the existing 99%/500ms burn-rate SLO. - Keep bounded-cardinality, Alloy logs, platform ownership; defer traces (no Tempo) and per-collection. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
From the Ops BB (Versioneer/ESA) STAC SLO demo: - The missing signal is operation-level latency (search vs item-listing vs collection-listing); URL-as-label is the cardinality trap they backed off — native route-template metrics solve it safely. - POST /search is the latency-critical path the SLO alert targets. - Agreed minimum bar: enable the framework's out-of-the-box metrics. - Alerting is SLO/user-impact-centric, not CPU/memory. - Keep enrichment correlates gateway/app/DB latency; native app-layer latency makes the "is it the app?" branch accurate. - Per-collection (e.g. only VHR slow) is a real but deeper, deferred need. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop revision-style framing so the document reads as an original proposal (no implied earlier published version). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Clarify GET vs POST burn-rate records; POST /search is the critical path. - Note /metrics is cluster-internal only (port 8080), not via public APISIX ingress (matters for the auth proxy). - Make burn-rate rebasing the Ops BB's decision; eoAPI only exposes signals. - Soften the logs row (collected as-is; structured logging is later). - Add deployed-name <-> upstream-package mapping and cross-repo dependency order; flag the demo Drive link as access-restricted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a focused design doc —
docs/design/observability-operations-bb.md— for the first increment of monitoring & logging integration with the EOEPCA Operations BB (#202). Registered under Design in the nav.It is deliberately narrow and eoAPI-first: the work lands in the apps and the
eoapi-k8schart we maintain as a generic, opt-in capability, and EOEPCA is reached with a few Helm-values changes.What it proposes
Aligned with the Operations BB docs (STAC Scenario, ServiceMonitors, Alerting & SLOs) and the EOEPCA+ demo (8 May 2026):
/metricsis 404 oneoapi-stacandeoapi-stac-auth-proxy; no data-access ServiceMonitor)./metrics(operation = search / item-listing / collection-listing — the signal APISIX can't give safely) oneoapi-stac+eoapi-stac-auth-proxy, an opt-inServiceMonitor, and dashboard/alert panels — feeding the existing 99% < 500ms SLO./metrics; logs unchanged (Alloy already collects them).Acceptance criterion (#202)
The doc itself is the clarification of the eoAPI ↔ Operations BB monitoring/logging dependency — the single AC on #202. Implementation (app
/metrics+ chart ServiceMonitor + EOEPCA values) is the follow-on, with the critical path and cross-repo dependencies documented.Asking reviewers
Feedback welcome on the contract (eoAPI vs platform split), the metric/label shape, and the STAC-slice scope. Flagging @achtsnits / @pantierra.
Relates to #202 and developmentseed/eoAPI#193.
🤖 Generated with Claude Code