Skip to content

add deployment and observability scaffold#1

Open
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin/1781647106-deployment-scaffold
Open

add deployment and observability scaffold#1
devin-ai-integration[bot] wants to merge 4 commits into
mainfrom
devin/1781647106-deployment-scaffold

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Scaffolds the deployment and observability infrastructure for the git mirror proxy cluster (7 nodes: 3 USW, 2 EUW, 2 EUC), modeled after storage-agent's Ansible/Grafana setup.

ansible/ — Deployment automation:

  • setup.yaml: One-time node setup (systemd unit, NVMe mirror dir, env config, Vector→Axiom logging)
  • roll.yaml: Rolling deploy with serial: 1 — build from branch or download release binary, restart, health-check before moving to next node
  • Per-region inventory files with placeholder IPs (fill after provisioning)
  • templates/: systemd unit (SIGTERM graceful shutdown, EnvironmentFile), env config (MIRROR_DIR, MIRROR_MAX_SIZE=80%, SYNC_STALE_AFTER=2s, AUTH_MODE=pass-through), Vector config (journald → Axiom smart-git-proxy dataset)

grafana/ — Monitoring:

  • Dashboard with 10 panels across 4 sections: clone rate, cache hit rate, clone latency (p50/p95/p99), upstream fetch rate, mirror count, NVMe usage (gauge + timeseries), top repos by request rate, upstream fetch errors, LRU evictions
  • Alert rules: node down, NVMe >80%/>95%, upstream fetch errors, p95 clone latency >30s, high eviction rate

Related: design doc PR

Link to Devin session: https://app.devin.ai/sessions/9193bf03dfa14cf684447384ef17a8db


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled. (Staging)


Open in Devin Review

ansible/: systemd unit, env config, rolling deploy playbook, Vector→Axiom
  log forwarding, per-region inventory files (3 USW, 2 EUW, 2 EUC, staging)
grafana/: dashboard with clone rate, cache hit rate, latency, upstream
  fetches, NVMe usage, per-repo traffic, errors. Alert rules for node
  health, NVMe usage, upstream errors, clone latency, eviction rate.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

cursor[bot]

This comment was marked as resolved.

- grafana: align metric names with internal/metrics/metrics.go
  (requests_total, request_seconds, sync_total, errors_total,
  responses_total) instead of invented names
- ansible/setup: fail explicitly when /mnt/nvme is missing instead
  of silently skipping mirror directory creation
- ansible/roll: download goreleaser tar.gz archive and extract,
  matching the actual release asset naming convention

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 5 potential issues.

Open in Devin Review

Comment thread ansible/roll.yaml
Comment on lines +48 to +55
shell: |
set -euo pipefail
export PATH=/usr/local/go/bin:$PATH
cd /tmp/smart-git-proxy-build
make build
args:
executable: /bin/bash
when: branch != ""

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 make build fails on remote servers because mise is not installed

The Makefile defines GO ?= mise exec -- go (Makefile:1), which only uses the environment's GO if it's already set. The Ansible build step (roll.yaml:48-54) exports PATH=/usr/local/go/bin:$PATH but never sets GO=go, so make build expands to mise exec -- go build .... Since mise is not installed on remote deployment targets, the build will always fail with a command-not-found error.

Suggested change
shell: |
set -euo pipefail
export PATH=/usr/local/go/bin:$PATH
cd /tmp/smart-git-proxy-build
make build
args:
executable: /bin/bash
when: branch != ""
- name: Build from branch
shell: |
set -euo pipefail
export PATH=/usr/local/go/bin:$PATH
export GO=go
cd /tmp/smart-git-proxy-build
make build
args:
executable: /bin/bash
when: branch != ""
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aa31b37. Added export GO=go before make build so it bypasses the mise exec -- go default in the Makefile.

Comment thread ansible/roll.yaml
Comment on lines +81 to +85
shell: |
set -euo pipefail
mkdir -p /tmp/smart-git-proxy-release
tar -xzf /tmp/smart-git-proxy-release.tar.gz -C /tmp/smart-git-proxy-release
install -m 0755 /tmp/smart-git-proxy-release/smart-git-proxy /usr/local/bin/smart-git-proxy

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Service unconditionally restarted even when no new binary is deployed

The "Restart smart-git-proxy" task at lines 81-85 has no when condition. If roll.yaml is invoked without -e branch=... or -e release_tag=..., both variables default to "", all binary-install steps are skipped, but the service is still restarted — causing unnecessary downtime on production nodes during a rolling deploy (serial: 1).

Suggested change
shell: |
set -euo pipefail
mkdir -p /tmp/smart-git-proxy-release
tar -xzf /tmp/smart-git-proxy-release.tar.gz -C /tmp/smart-git-proxy-release
install -m 0755 /tmp/smart-git-proxy-release/smart-git-proxy /usr/local/bin/smart-git-proxy
- name: Restart smart-git-proxy
systemd:
name: smart-git-proxy
state: restarted
enabled: yes
when: branch != "" or release_tag != ""
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aa31b37. Rather than a conditional when, the playbook now fails early if neither branch nor release_tag is provided — since running roll.yaml without deploying anything is always a mistake.

Comment thread ansible/secrets.yml
Comment on lines +1 to +3
# Encrypted with ansible-vault.
# Contains: github_token, axiom_token
# To edit: ansible-vault edit secrets.yml

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 secrets.yml is a plaintext placeholder committed to the repo

The ansible/secrets.yml file is committed as a plaintext comment-only placeholder. While it currently contains no actual secrets (just comments explaining what should go there), the intent per ansible/README.md:58-60 is for this to be an ansible-vault encrypted file. Having a plaintext placeholder in the repo means someone could accidentally add real secrets to it without encrypting first. Consider adding it to .gitignore or committing an actually-encrypted empty vault file instead.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. This is intentionally a placeholder — the file needs to be replaced with an actual ansible-vault encrypted file before first use. Adding to .gitignore would mean operators need to create it from scratch without a template. The README documents the expected contents and ansible-vault edit workflow. Happy to add a .gitignore entry if preferred.

"id": 2,
"targets": [
{
"expr": "sum(rate(smart_git_proxy_requests_total{instance=~\"$instance\"}[$__rate_interval])) by (kind)",

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Grafana dashboard JSON has a truncated PromQL expression

At line 57 of the dashboard JSON, the Cache Hit Rate panel's expr field is truncated: it ends with *... followed by [truncated]. This appears to be a diff rendering artifact — if the actual file on disk is also truncated, the dashboard import will fail or show a broken panel. Worth verifying the actual file content.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a diff rendering artifact — the actual file on disk is not truncated. The panel no longer exists in the latest commit anyway (replaced with "Requests by Kind" which uses the real smart_git_proxy_requests_total metric).

type: axiom
inputs:
- add_metadata
token: "{{ axiom_token }}"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Vector config template exposes axiom_token in plaintext on disk

The Vector config at ansible/templates/vector.yaml.j2:33 renders {{ axiom_token }} directly into /etc/vector/vector.yaml with mode 0644 (world-readable, set at ansible/tasks/setup_axiom.yaml:42). This means the Axiom API token is readable by any user on the system. Consider tightening the file permissions to 0600 or using a systemd environment variable approach similar to how HOSTNAME is handled.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aa31b37. Changed vector.yaml file mode from 0644 to 0600.

Comment thread ansible/roll.yaml
Comment thread ansible/roll.yaml
Comment thread grafana/dashboards/smart-git-proxy.json Outdated
…, vector perms

- roll.yaml: export GO=go before make build (Makefile defaults to
  mise exec -- go which isn't on deploy targets)
- roll.yaml: fail early if neither branch nor release_tag provided,
  preventing needless service restart
- grafana: group histogram_quantile by (le, instance) to avoid
  merging latencies across nodes
- setup_axiom: tighten vector.yaml to 0600 (contains axiom_token)

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit aa31b37. Configure here.

Comment thread ansible/roll.yaml
# Goreleaser publishes archives as: smart-git-proxy_<version>_linux_amd64.tar.gz
- name: Download release archive
get_url:
url: "https://github.com/useblacksmith/smart-git-proxy/releases/download/{{ release_tag }}/smart-git-proxy_{{ release_tag | regex_replace('^v', '') }}_linux_amd64.tar.gz"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong GitHub release download URL

High Severity

The release deploy path downloads tarballs from useblacksmith/smart-git-proxy, but GoReleaser in this repo publishes GitHub releases to runs-on/smart-git-proxy. A -e release_tag=... roll therefore requests assets from a repo that does not receive those releases, so the download step fails or never finds the expected archive.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit aa31b37. Configure here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design — releases will be published to useblacksmith/smart-git-proxy (this fork), not the upstream runs-on/smart-git-proxy. The goreleaser config in this repo will publish to this repo's releases page once configured.

Comment thread ansible/tasks/setup_axiom.yaml
Comment thread ansible/setup.yaml
- name: Ensure ENV is determined
fail:
msg: "Failed to determine environment from inventory file name."
when: ENV == ''

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment tied to inventory filename

Medium Severity

Setup and roll playbooks derive ENV only from whether the inventory filename contains production or staging, then fail if neither substring matches. Inventory files already define env under [all:vars], so a valid inventory with a different filename fails even when env is set correctly for Vector metadata.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit aa31b37. Configure here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — all inventory files follow the production-*.ini / staging-*.ini naming convention, so the substring check is reliable. Using the inventory-level env var would work too, but the filename approach matches how the storage-agent playbooks handle it and provides an extra safety net (you can't accidentally run a production playbook against a staging inventory even if env is misconfigured).

Ensures apt cache is refreshed after adding the Vector repo so the
package is discoverable on first run.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant