Skip to content

Merge dev to main: monitoring, Alloy, pool fix#196

Merged
crtahlin merged 8 commits into
mainfrom
dev
Apr 15, 2026
Merged

Merge dev to main: monitoring, Alloy, pool fix#196
crtahlin merged 8 commits into
mainfrom
dev

Conversation

@crtahlin
Copy link
Copy Markdown

Summary

Promotes all monitoring work from dev to production.

What's included

  • Prometheus /metrics endpoint with auto-instrumented HTTP metrics + custom business metrics
  • Grafana Alloy sidecar container pushing metrics to Grafana Cloud every 15s
  • 23-panel Grafana dashboard at datafund.grafana.net/d/gateway-overview
  • Stamp pool replenishment 429 retry fix (Fix stamp pool 429 rate limiting during replenishment #176)
  • 19 new metrics tests, zero regressions
  • Updated CLAUDE.md, README.md, .env.example, deploy.yml

What happens on merge

  1. Deploy builds swarm_connect:main image with /metrics endpoint
  2. Writes Grafana Cloud credentials to .env from GitHub secrets
  3. Starts 3 containers: production gateway (8899), dev gateway (8900), Alloy
  4. Alloy scrapes production gateway and pushes with environment=main label
  5. "main" appears in the Grafana Cloud dashboard environment dropdown

Issues addressed

crtahlin added 8 commits April 3, 2026 14:35
The Bee node returns 429 when stamp purchases are made back-to-back.
This adds a 15-second delay between consecutive purchases in the
replenishment loop and retry logic (3 attempts with backoff) in
_purchase_stamp() for transient 429 errors.

Fixes #175
Fix stamp pool 429 rate limiting during replenishment
Add Prometheus metrics endpoint and monitoring foundation
…191)

Add Alloy as a Docker service that scrapes /metrics from both gateway
containers and pushes to Grafana Cloud Prometheus. Credentials stored in
GitHub secrets, injected via .env file at deploy time. Scrapes staging
(provenance_gateway_dev:8000) and production (provenance_gateway:8000)
with environment labels.
…or to dashboard (#192)

- Alloy: staging → development, production → main (matches branch names)
- deploy.yml: GATEWAY_ENVIRONMENT=development for dev, =main for main
- Dashboard: add environment dropdown, filter all panels by $environment,
  add Bee API Errors panel
…cs (#193)

- Fix stamp pool metrics: use current_levels/reserve_config from get_status()
  instead of non-existent reserves key
- Add descriptions to all 17 dashboard panels explaining what each shows
- Add Bee API Errors panel to dashboard
- Update CLAUDE.md with production monitoring stack architecture and setup
- Update README.md with full monitoring section including Grafana Cloud setup
* Fix pool metrics: access dataclass attributes instead of dict keys

* Fix pool metrics dataclass access, add stamp provisioning and debug panels

- Fix pool metrics poller: use dataclass attributes instead of dict.get()
- Add 12 new dashboard panels: stamp provisioning breakdown (pool vs direct,
  by size), data volume, HTTP status codes, latency by endpoint, notary
  signing, gateway version info, deploy history
- Dashboard now has 23 panels across 7 rows
@crtahlin crtahlin merged commit 6322870 into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant