Deployment: Production CD for guardrails #100
Conversation
📝 WalkthroughWalkthroughAdds a GitHub Actions CD workflow that deploys tagged releases to an EC2 instance via SSM (assumes an AWS OIDC role, runs a shell script to fetch code, build Docker Compose, run Alembic migrations, start services, and prune images) and a polling loop to wait for SSM command completion. Updates validator normalization to drop the ChangesProduction CD Workflow
Validator Normalization
Sequence DiagramsequenceDiagram
participant GH as "GitHub Actions"
participant AWS as "AWS (STS / SSM)"
participant EC2 as "EC2 Instance (SSM Agent)"
participant Backend as "Backend Service (Docker Compose)"
participant DB as "Database"
GH->>AWS: Assume role via OIDC
AWS-->>GH: Temporary credentials
GH->>AWS: SendCommand (AWS-RunShellScript) -> target EC2_INSTANCE_ID, script (checkout, build, alembic, up, prune)
AWS->>EC2: Deliver SSM command
EC2->>EC2: Execute script (chown, checkout, build images)
EC2->>Backend: Start containers via docker compose
Backend->>DB: Run Alembic migrations (uv run alembic upgrade head)
EC2->>AWS: Command execution result (stdout/stderr, status)
loop Poll until completion
GH->>AWS: get-command-invocation(CommandId)
AWS-->>GH: status (Pending/Success/Failed/...)
end
GH-->>GH: On Success -> print output & exit 0
GH-->>GH: On Failure/Timeout -> print invocation JSON & exit 1
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/schemas/guardrail_config.py (1)
78-88:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRemove duplicate
"name"indrop_fieldsset (lint failure risk).
"name"appears twice in the same set (Line 80 and Line 87). This is redundant and triggers Ruff B033.Suggested fix
drop_fields = { "id", "name", "organization_id", "project_id", "stage", "is_enabled", "created_at", "updated_at", - "name", }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/schemas/guardrail_config.py` around lines 78 - 88, The drop_fields set in guardrail_config.py contains a duplicate entry for "name" which is redundant and triggers lint B033; edit the drop_fields set (the variable named drop_fields) to remove the duplicate "name" so each field appears only once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cd-production.yml:
- Around line 16-17: The workflow is currently bound to staging-protected
environment and instance secrets (e.g., the environment key
"AWS_STAGING_ENV_SECRETS" and the staging instance secret referenced later);
update all staging-scoped references to the production equivalents so the CD job
targets production infrastructure — replace "AWS_STAGING_ENV_SECRETS" with the
production environment secret name and swap any staging instance secret names
(the ones used around lines 31-33 and 47-49) for their production instance
secret names so the release tag deploys to production.
- Around line 34-40: The deployment currently invokes Alembic directly and
bypasses the project's prestart flow; instead invoke the documented prestart
profile so backend/scripts/prestart.sh (which runs backend_pre_start.py and
initial_data.py) executes. Replace the direct alembic run in the remote command
sequence with the compose-based prestart invocation: after building (docker
compose build) run the service using the prestart profile (docker compose run
--rm --profile prestart backend) so the prestart script runs, then bring
services up (docker compose up -d --remove-orphans) and prune images as before;
update the command sequence around the docker compose build / run / up steps
accordingly.
- Around line 34-40: The deployment script currently hard-resets the remote
checkout to origin/main (the commands array contains git reset --hard
origin/main) which ignores the release tag; change the remote git commands
inside the AWS SSM send-command parameters so the instance checks out the exact
release tag that triggered the workflow (e.g., fetch tags and git checkout the
provided release/tag variable) instead of resetting to origin/main; update the
invocation that builds the commands string (the parameters entry in the
send-command call that uses BUILD_DIRECTORY and AWS-RunShellScript) to use the
workflow-provided tag variable (or GITHUB_REF/GITHUB_SHA passed into the runner)
and ensure quoting/escaping is preserved.
---
Outside diff comments:
In `@backend/app/schemas/guardrail_config.py`:
- Around line 78-88: The drop_fields set in guardrail_config.py contains a
duplicate entry for "name" which is redundant and triggers lint B033; edit the
drop_fields set (the variable named drop_fields) to remove the duplicate "name"
so each field appears only once.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b3324244-b54a-4600-a7cf-0b8df45b36b0
📒 Files selected for processing (2)
.github/workflows/cd-production.ymlbackend/app/schemas/guardrail_config.py
| jobs: | ||
| deploy: | ||
| runs-on: ubuntu-latest | ||
| environment: AWS_STAGING_ENV_SECRETS |
There was a problem hiding this comment.
| environment: AWS_STAGING_ENV_SECRETS | |
| environment: AWS_ENV_SECRETS |
|
|
||
| - name: Wait for SSM command to finish | ||
| env: | ||
| INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }} |
There was a problem hiding this comment.
| INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }} | |
| INSTANCE_ID: ${{ secrets.EC2_INSTANCE_ID }} |
| - name: Trigger deploy on EC2 via SSM | ||
| id: ssm | ||
| env: | ||
| INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }} |
There was a problem hiding this comment.
| INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }} | |
| INSTANCE_ID: ${{ secrets. EC2_INSTANCE_ID }} |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cd-production.yml:
- Around line 51-85: The polling loop uses a fixed 20-iteration limit (for i in
{1..20}) which gives only 300s total and can time out during cold builds; update
the workflow to increase the polling budget (e.g., raise the iteration count or
make the sleep/iterations configurable via inputs/env) and align it with the SSM
command timeout so CMD_ID/INSTANCE_ID polling waits long enough for long docker
compose builds; adjust the loop controlling symbols (for i in {1..20}, sleep 15)
and expose a configurable variable for total timeout or number of retries used
when checking aws ssm get-command-invocation so the workflow can be bumped for
production cold builds.
- Around line 3-7: The workflow currently allows manual runs via
workflow_dispatch which can bypass the tag-only deploy gate; update the
deployment job/steps that perform the production deploy (the steps triggered by
on: workflow_dispatch) to only proceed when the run target is a tag by adding a
conditional check like if: github.ref_type == 'tag' on the job or critical
steps, or change workflow_dispatch to require an explicit ref input (e.g.,
inputs: ref) and validate that input is a tag before running the deploy; look
for the workflow_dispatch block and the production deploy job/steps to apply the
if check or input validation.
- Line 39: Update the commands array used in the CD workflow where the shell
string currently runs "docker compose up -d --remove-orphans" so it uses "docker
compose up --wait --remove-orphans" (you may also add "--wait-timeout <seconds>"
if you want an upper bound). Change the literal in the parameter passed to
--parameters commands to replace the "-d" form with the "--wait" form so the
step will block until container healthchecks (e.g., the backend healthcheck
defined in docker-compose) report healthy before the script exits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6e765c79-6adb-43f9-bee0-77aa5ba9cf2b
📒 Files selected for processing (1)
.github/workflows/cd-production.yml
| push: | ||
| tags: | ||
| - "v*" # Deploy only when tags like v1.0.0, v2.1.0, etc., are created | ||
| workflow_dispatch: |
There was a problem hiding this comment.
We don;t need this workflow_dispatch tag, this flow is automatic trigger when the release tag added.
| --instance-ids "$INSTANCE_ID" \ | ||
| --document-name "AWS-RunShellScript" \ | ||
| --comment "Deploy kaapi-guardrails production" \ | ||
| --parameters commands="[\"set -eux\",\"sudo chown -R ec2-user:ec2-user ${BUILD_DIRECTORY}\",\"sudo -iu ec2-user bash -lc \\\"cd ${BUILD_DIRECTORY} && git fetch --tags origin && git checkout --force ${{ github.sha }} && docker compose build && docker compose run --rm --entrypoint \\\\\\\"\\\\\\\" backend uv run alembic upgrade head && docker compose up -d --remove-orphans && docker image prune -f\\\"\"]" \ |
There was a problem hiding this comment.
Remove these unwanted revert back slash, and write the cleans command here.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
.github/workflows/cd-production.yml (2)
34-34:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the repo’s prestart path and block until services are actually ready.
At Line 34, the command still runs Alembic directly and then uses detached startup. That bypasses the
prestartservice flow defined indocker-compose.yml(which runsbackend/scripts/prestart.sh) and can mark deployment successful before service health is confirmed.Suggested update
- DEPLOY_CMD="cd ${BUILD_DIRECTORY} && git fetch --tags origin && git checkout --force ${{ github.sha }} && docker compose build && docker compose run --rm --entrypoint '' backend uv run alembic upgrade head && docker compose up -d --remove-orphans && docker image prune -f" + DEPLOY_CMD="cd ${BUILD_DIRECTORY} && git fetch --tags origin && git checkout --force ${{ github.sha }} && docker compose build && docker compose --profile prestart up prestart && docker compose up --wait --wait-timeout 300 --remove-orphans && docker image prune -f"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-production.yml at line 34, Replace the one-liner DEPLOY_CMD that runs Alembic directly and starts services detached with a flow that invokes the repo's prestart path and blocks until services are healthy: update the DEPLOY_CMD to fetch and checkout the commit as before, then use docker compose to run the prestart flow (the service that executes backend/scripts/prestart.sh as defined in docker-compose.yml) instead of running Alembic directly, and bring the compose stack up in a foreground/blocking way (or include an explicit health-check wait loop) so the job only finishes when services report healthy; reference DEPLOY_CMD, backend/scripts/prestart.sh and the docker-compose prestart service when making the change.
52-86:⚠️ Potential issue | 🟠 Major | ⚡ Quick winIncrease the polling window to match real production deploy duration.
At Line 52,
20 * 15s = 300sis a tight cap for cold builds/migrations and can fail the workflow while SSM is still progressing. Make timeout configurable and align it with your expected deploy envelope.Suggested update
- for i in {1..20}; do + MAX_WAIT_SECONDS=1800 + POLL_INTERVAL_SECONDS=15 + MAX_POLLS=$((MAX_WAIT_SECONDS / POLL_INTERVAL_SECONDS)) + for i in $(seq 1 "$MAX_POLLS"); do @@ - sleep 15 + sleep "$POLL_INTERVAL_SECONDS" done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cd-production.yml around lines 52 - 86, The current SSM polling loop uses a hardcoded 20 iterations with sleep 15 ("for i in {1..20}" and "sleep 15"), which caps the wait to ~300s and can expire during cold builds/migrations; make the timeout configurable by replacing the fixed loop/count with a parameter (e.g. POLL_RETRIES or POLL_TIMEOUT) and a configurable interval (e.g. POLL_INTERVAL) that are read from environment or workflow inputs, then use those variables instead of the literal "1..20" and "sleep 15" while still checking CMD_ID/INSTANCE_ID statuses and returning the same success/failure behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/cd-production.yml:
- Line 34: Replace the one-liner DEPLOY_CMD that runs Alembic directly and
starts services detached with a flow that invokes the repo's prestart path and
blocks until services are healthy: update the DEPLOY_CMD to fetch and checkout
the commit as before, then use docker compose to run the prestart flow (the
service that executes backend/scripts/prestart.sh as defined in
docker-compose.yml) instead of running Alembic directly, and bring the compose
stack up in a foreground/blocking way (or include an explicit health-check wait
loop) so the job only finishes when services report healthy; reference
DEPLOY_CMD, backend/scripts/prestart.sh and the docker-compose prestart service
when making the change.
- Around line 52-86: The current SSM polling loop uses a hardcoded 20 iterations
with sleep 15 ("for i in {1..20}" and "sleep 15"), which caps the wait to ~300s
and can expire during cold builds/migrations; make the timeout configurable by
replacing the fixed loop/count with a parameter (e.g. POLL_RETRIES or
POLL_TIMEOUT) and a configurable interval (e.g. POLL_INTERVAL) that are read
from environment or workflow inputs, then use those variables instead of the
literal "1..20" and "sleep 15" while still checking CMD_ID/INSTANCE_ID statuses
and returning the same success/failure behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a8069f6c-57de-40b4-bda1-6c95b7a2676d
📒 Files selected for processing (1)
.github/workflows/cd-production.yml
Summary
Target issue is #103
Notes
Chores
Bug Fixes
Summary by CodeRabbit
Deployment
Configuration