Skip to content

Deployment: Production CD for guardrails #100

Merged
nishika26 merged 5 commits intomainfrom
enhancement/cd_production
May 4, 2026
Merged

Deployment: Production CD for guardrails #100
nishika26 merged 5 commits intomainfrom
enhancement/cd_production

Conversation

@nishika26
Copy link
Copy Markdown
Contributor

@nishika26 nishika26 commented Apr 30, 2026

Summary

Target issue is #103

Notes

  • Chores

    • Added production deployment automation supporting version-tagged releases and manual dispatch; includes orchestration of services, remote instance deployment, repository synchronization, Docker image management, and automated database migrations to streamline production releases.
  • Bug Fixes

    • Validator configuration processing updated to exclude the name field from individual validator definitions during API schema normalization.

Summary by CodeRabbit

  • Deployment

    • Added automated production deployment pipeline that triggers on version releases, handling Docker image builds and database migrations.
  • Configuration

    • Updated validator configuration processing to refine how validators are normalized in API configurations.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

Adds a GitHub Actions CD workflow that deploys tagged releases to an EC2 instance via SSM (assumes an AWS OIDC role, runs a shell script to fetch code, build Docker Compose, run Alembic migrations, start services, and prune images) and a polling loop to wait for SSM command completion. Updates validator normalization to drop the name field from validator objects.

Changes

Production CD Workflow

Layer / File(s) Summary
Workflow Definition
.github/workflows/cd-production.yml
New workflow triggered on tag pushes v*; sets concurrency and run permissions.
Permissions / AWS Setup
.github/workflows/cd-production.yml
Workflow-level permissions include id-token: write and contents: read; uses aws-actions/configure-aws-credentials@v6 to assume role via OIDC with EC2_DEPLOY_ROLE_ARN and AWS_REGION.
SSM Command Construction
.github/workflows/cd-production.yml
Sends AWS-RunShellScript to EC2_INSTANCE_ID targeting BUILD_DIRECTORY and ${{ github.sha }}; captures CommandId to outputs.
Remote Script Actions
.github/workflows/cd-production.yml
Remote script: set -eux, chown build dir, git fetch --tags, git checkout to exact SHA, build Docker Compose images, run uv run alembic upgrade head inside backend, docker compose up -d --remove-orphans, and prune images.
Polling / Completion
.github/workflows/cd-production.yml
Polls aws ssm get-command-invocation up to 20 attempts (15s sleep) for Success; on success prints invocation JSON and exits 0; on Failed/Cancelled/TimedOut or timeout prints invocation JSON and exits 1.

Validator Normalization

Layer / File(s) Summary
Normalization Logic
backend/app/schemas/guardrail_config.py
normalize_validators_from_config_api appends "name" to drop_fields, removing validator-level name from normalized output.
Behavior Preservation
backend/app/schemas/guardrail_config.py
Other normalization behavior unchanged, including mapping/handling of on_fail_action.

Sequence Diagram

sequenceDiagram
    participant GH as "GitHub Actions"
    participant AWS as "AWS (STS / SSM)"
    participant EC2 as "EC2 Instance (SSM Agent)"
    participant Backend as "Backend Service (Docker Compose)"
    participant DB as "Database"

    GH->>AWS: Assume role via OIDC
    AWS-->>GH: Temporary credentials
    GH->>AWS: SendCommand (AWS-RunShellScript) -> target EC2_INSTANCE_ID, script (checkout, build, alembic, up, prune)
    AWS->>EC2: Deliver SSM command
    EC2->>EC2: Execute script (chown, checkout, build images)
    EC2->>Backend: Start containers via docker compose
    Backend->>DB: Run Alembic migrations (uv run alembic upgrade head)
    EC2->>AWS: Command execution result (stdout/stderr, status)
    loop Poll until completion
      GH->>AWS: get-command-invocation(CommandId)
      AWS-->>GH: status (Pending/Success/Failed/...)
    end
    GH-->>GH: On Success -> print output & exit 0
    GH-->>GH: On Failure/Timeout -> print invocation JSON & exit 1
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
A tag hops in and the action wakes,
SSM whispers, the EC2 bakes.
Images bloom and migrations stride,
Validators shed a name with pride.
Deploys curl home — a carrot-sized ride.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Deployment: Production CD for guardrails' is directly related to the main change—a production CD workflow for guardrails. However, it is somewhat generic and doesn't specify key details like the EC2 deployment mechanism or version-tagged triggers.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch enhancement/cd_production

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/schemas/guardrail_config.py (1)

78-88: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate "name" in drop_fields set (lint failure risk).

"name" appears twice in the same set (Line 80 and Line 87). This is redundant and triggers Ruff B033.

Suggested fix
         drop_fields = {
             "id",
             "name",
             "organization_id",
             "project_id",
             "stage",
             "is_enabled",
             "created_at",
             "updated_at",
-            "name",
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/schemas/guardrail_config.py` around lines 78 - 88, The
drop_fields set in guardrail_config.py contains a duplicate entry for "name"
which is redundant and triggers lint B033; edit the drop_fields set (the
variable named drop_fields) to remove the duplicate "name" so each field appears
only once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/cd-production.yml:
- Around line 16-17: The workflow is currently bound to staging-protected
environment and instance secrets (e.g., the environment key
"AWS_STAGING_ENV_SECRETS" and the staging instance secret referenced later);
update all staging-scoped references to the production equivalents so the CD job
targets production infrastructure — replace "AWS_STAGING_ENV_SECRETS" with the
production environment secret name and swap any staging instance secret names
(the ones used around lines 31-33 and 47-49) for their production instance
secret names so the release tag deploys to production.
- Around line 34-40: The deployment currently invokes Alembic directly and
bypasses the project's prestart flow; instead invoke the documented prestart
profile so backend/scripts/prestart.sh (which runs backend_pre_start.py and
initial_data.py) executes. Replace the direct alembic run in the remote command
sequence with the compose-based prestart invocation: after building (docker
compose build) run the service using the prestart profile (docker compose run
--rm --profile prestart backend) so the prestart script runs, then bring
services up (docker compose up -d --remove-orphans) and prune images as before;
update the command sequence around the docker compose build / run / up steps
accordingly.
- Around line 34-40: The deployment script currently hard-resets the remote
checkout to origin/main (the commands array contains git reset --hard
origin/main) which ignores the release tag; change the remote git commands
inside the AWS SSM send-command parameters so the instance checks out the exact
release tag that triggered the workflow (e.g., fetch tags and git checkout the
provided release/tag variable) instead of resetting to origin/main; update the
invocation that builds the commands string (the parameters entry in the
send-command call that uses BUILD_DIRECTORY and AWS-RunShellScript) to use the
workflow-provided tag variable (or GITHUB_REF/GITHUB_SHA passed into the runner)
and ensure quoting/escaping is preserved.

---

Outside diff comments:
In `@backend/app/schemas/guardrail_config.py`:
- Around line 78-88: The drop_fields set in guardrail_config.py contains a
duplicate entry for "name" which is redundant and triggers lint B033; edit the
drop_fields set (the variable named drop_fields) to remove the duplicate "name"
so each field appears only once.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b3324244-b54a-4600-a7cf-0b8df45b36b0

📥 Commits

Reviewing files that changed from the base of the PR and between 401e397 and f892fd3.

📒 Files selected for processing (2)
  • .github/workflows/cd-production.yml
  • backend/app/schemas/guardrail_config.py

Comment thread .github/workflows/cd-production.yml Outdated
Comment thread .github/workflows/cd-production.yml
Comment thread .github/workflows/cd-production.yml Outdated
jobs:
deploy:
runs-on: ubuntu-latest
environment: AWS_STAGING_ENV_SECRETS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
environment: AWS_STAGING_ENV_SECRETS
environment: AWS_ENV_SECRETS

Comment thread .github/workflows/cd-production.yml Outdated

- name: Wait for SSM command to finish
env:
INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }}
INSTANCE_ID: ${{ secrets.EC2_INSTANCE_ID }}

Comment thread .github/workflows/cd-production.yml Outdated
- name: Trigger deploy on EC2 via SSM
id: ssm
env:
INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
INSTANCE_ID: ${{ secrets.STAGING_EC2_INSTANCE_ID }}
INSTANCE_ID: ${{ secrets. EC2_INSTANCE_ID }}

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/cd-production.yml:
- Around line 51-85: The polling loop uses a fixed 20-iteration limit (for i in
{1..20}) which gives only 300s total and can time out during cold builds; update
the workflow to increase the polling budget (e.g., raise the iteration count or
make the sleep/iterations configurable via inputs/env) and align it with the SSM
command timeout so CMD_ID/INSTANCE_ID polling waits long enough for long docker
compose builds; adjust the loop controlling symbols (for i in {1..20}, sleep 15)
and expose a configurable variable for total timeout or number of retries used
when checking aws ssm get-command-invocation so the workflow can be bumped for
production cold builds.
- Around line 3-7: The workflow currently allows manual runs via
workflow_dispatch which can bypass the tag-only deploy gate; update the
deployment job/steps that perform the production deploy (the steps triggered by
on: workflow_dispatch) to only proceed when the run target is a tag by adding a
conditional check like if: github.ref_type == 'tag' on the job or critical
steps, or change workflow_dispatch to require an explicit ref input (e.g.,
inputs: ref) and validate that input is a tag before running the deploy; look
for the workflow_dispatch block and the production deploy job/steps to apply the
if check or input validation.
- Line 39: Update the commands array used in the CD workflow where the shell
string currently runs "docker compose up -d --remove-orphans" so it uses "docker
compose up --wait --remove-orphans" (you may also add "--wait-timeout <seconds>"
if you want an upper bound). Change the literal in the parameter passed to
--parameters commands to replace the "-d" form with the "--wait" form so the
step will block until container healthchecks (e.g., the backend healthcheck
defined in docker-compose) report healthy before the script exits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e765c79-6adb-43f9-bee0-77aa5ba9cf2b

📥 Commits

Reviewing files that changed from the base of the PR and between f892fd3 and 9e8c080.

📒 Files selected for processing (1)
  • .github/workflows/cd-production.yml

Comment thread .github/workflows/cd-production.yml Outdated
Comment thread .github/workflows/cd-production.yml Outdated
Comment thread .github/workflows/cd-production.yml
@nishika26 nishika26 linked an issue May 4, 2026 that may be closed by this pull request
@nishika26 nishika26 self-assigned this May 4, 2026
@nishika26 nishika26 added enhancement New feature or request labels May 4, 2026
Comment thread .github/workflows/cd-production.yml Outdated
push:
tags:
- "v*" # Deploy only when tags like v1.0.0, v2.1.0, etc., are created
workflow_dispatch:
Copy link
Copy Markdown

@Ayush8923 Ayush8923 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don;t need this workflow_dispatch tag, this flow is automatic trigger when the release tag added.

Comment thread .github/workflows/cd-production.yml Outdated
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--comment "Deploy kaapi-guardrails production" \
--parameters commands="[\"set -eux\",\"sudo chown -R ec2-user:ec2-user ${BUILD_DIRECTORY}\",\"sudo -iu ec2-user bash -lc \\\"cd ${BUILD_DIRECTORY} && git fetch --tags origin && git checkout --force ${{ github.sha }} && docker compose build && docker compose run --rm --entrypoint \\\\\\\"\\\\\\\" backend uv run alembic upgrade head && docker compose up -d --remove-orphans && docker image prune -f\\\"\"]" \
Copy link
Copy Markdown

@Ayush8923 Ayush8923 May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these unwanted revert back slash, and write the cleans command here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
.github/workflows/cd-production.yml (2)

34-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the repo’s prestart path and block until services are actually ready.

At Line 34, the command still runs Alembic directly and then uses detached startup. That bypasses the prestart service flow defined in docker-compose.yml (which runs backend/scripts/prestart.sh) and can mark deployment successful before service health is confirmed.

Suggested update
-          DEPLOY_CMD="cd ${BUILD_DIRECTORY} && git fetch --tags origin && git checkout --force ${{ github.sha }} && docker compose build && docker compose run --rm --entrypoint '' backend uv run alembic upgrade head && docker compose up -d --remove-orphans && docker image prune -f"
+          DEPLOY_CMD="cd ${BUILD_DIRECTORY} && git fetch --tags origin && git checkout --force ${{ github.sha }} && docker compose build && docker compose --profile prestart up prestart && docker compose up --wait --wait-timeout 300 --remove-orphans && docker image prune -f"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/cd-production.yml at line 34, Replace the one-liner
DEPLOY_CMD that runs Alembic directly and starts services detached with a flow
that invokes the repo's prestart path and blocks until services are healthy:
update the DEPLOY_CMD to fetch and checkout the commit as before, then use
docker compose to run the prestart flow (the service that executes
backend/scripts/prestart.sh as defined in docker-compose.yml) instead of running
Alembic directly, and bring the compose stack up in a foreground/blocking way
(or include an explicit health-check wait loop) so the job only finishes when
services report healthy; reference DEPLOY_CMD, backend/scripts/prestart.sh and
the docker-compose prestart service when making the change.

52-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Increase the polling window to match real production deploy duration.

At Line 52, 20 * 15s = 300s is a tight cap for cold builds/migrations and can fail the workflow while SSM is still progressing. Make timeout configurable and align it with your expected deploy envelope.

Suggested update
-          for i in {1..20}; do
+          MAX_WAIT_SECONDS=1800
+          POLL_INTERVAL_SECONDS=15
+          MAX_POLLS=$((MAX_WAIT_SECONDS / POLL_INTERVAL_SECONDS))
+          for i in $(seq 1 "$MAX_POLLS"); do
@@
-            sleep 15
+            sleep "$POLL_INTERVAL_SECONDS"
           done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/cd-production.yml around lines 52 - 86, The current SSM
polling loop uses a hardcoded 20 iterations with sleep 15 ("for i in {1..20}"
and "sleep 15"), which caps the wait to ~300s and can expire during cold
builds/migrations; make the timeout configurable by replacing the fixed
loop/count with a parameter (e.g. POLL_RETRIES or POLL_TIMEOUT) and a
configurable interval (e.g. POLL_INTERVAL) that are read from environment or
workflow inputs, then use those variables instead of the literal "1..20" and
"sleep 15" while still checking CMD_ID/INSTANCE_ID statuses and returning the
same success/failure behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/cd-production.yml:
- Line 34: Replace the one-liner DEPLOY_CMD that runs Alembic directly and
starts services detached with a flow that invokes the repo's prestart path and
blocks until services are healthy: update the DEPLOY_CMD to fetch and checkout
the commit as before, then use docker compose to run the prestart flow (the
service that executes backend/scripts/prestart.sh as defined in
docker-compose.yml) instead of running Alembic directly, and bring the compose
stack up in a foreground/blocking way (or include an explicit health-check wait
loop) so the job only finishes when services report healthy; reference
DEPLOY_CMD, backend/scripts/prestart.sh and the docker-compose prestart service
when making the change.
- Around line 52-86: The current SSM polling loop uses a hardcoded 20 iterations
with sleep 15 ("for i in {1..20}" and "sleep 15"), which caps the wait to ~300s
and can expire during cold builds/migrations; make the timeout configurable by
replacing the fixed loop/count with a parameter (e.g. POLL_RETRIES or
POLL_TIMEOUT) and a configurable interval (e.g. POLL_INTERVAL) that are read
from environment or workflow inputs, then use those variables instead of the
literal "1..20" and "sleep 15" while still checking CMD_ID/INSTANCE_ID statuses
and returning the same success/failure behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a8069f6c-57de-40b4-bda1-6c95b7a2676d

📥 Commits

Reviewing files that changed from the base of the PR and between 9e8c080 and d55f8f7.

📒 Files selected for processing (1)
  • .github/workflows/cd-production.yml

@nishika26 nishika26 merged commit 52de1e2 into main May 4, 2026
2 checks passed
@nishika26 nishika26 deleted the enhancement/cd_production branch May 4, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI/CD: Add Production CD pipeline

2 participants