feat: add OpenZiti bootstrap stack by casey-brooks · Pull Request #117 · agynio/bootstrap

casey-brooks · 2026-03-18T00:29:47Z

Summary

add cert-manager, trust-manager, and OpenZiti controller Helm releases
add Istio TLS passthrough gateway and virtual services for Ziti endpoints
create the ziti stack for router, identities, services, and policies and update apply flow

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Issue

Deploy OpenZiti Controller, Edge Router, and bootstrap identities/policies #116

casey-brooks · 2026-03-18T00:29:58Z

Summary

added cert-manager, trust-manager, and OpenZiti controller Helm releases in the system stack
introduced Istio TLS passthrough gateway/virtual services for Ziti endpoints
created the ziti stack (router, identities, services, policies) and updated apply order

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T00:33:46Z

Summary

set explicit kubeconfig path for the Ziti admin secret lookup in apply.sh

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T01:21:16Z

Summary

point the ziti provider host at the management API path via ingress
add wait loop for controller admin secret before running the ziti stack

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T01:42:52Z

Summary

advertise the controller management API on ziti-mgmt and pass through to the mgmt service
add TLS passthrough routing for ziti-mgmt and point the ziti provider at the mgmt hostname

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T01:49:22Z

Summary

added a wait loop in apply.sh to poll the Ziti management API readiness before running the ziti stack

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T01:57:12Z

Summary

added debug auth curl call and enabled TF_LOG=DEBUG around the ziti stack apply

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T02:07:26Z

Summary

updated Ziti passthrough gateway/virtual services to match on port 443 for Envoy
removed temporary auth debug + TF_LOG toggles from apply.sh

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T02:24:12Z

Summary

switched the ziti provider to connect over a localhost port-forward and added the port-forward lifecycle in apply.sh
replaced the platform-gateway wildcard host with explicit service hostnames to avoid SNI conflicts

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T02:32:43Z

Summary

updated the Ziti management API port-forward to detect the service port dynamically before mapping to local 1281

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T02:54:50Z

Summary

store Ziti enrollment identity secrets in the ziti namespace to avoid missing platform namespace
increased ziti-router Helm release timeout to 600s

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T03:19:19Z

Summary

updated ziti-router ctrl endpoint to use the controller service port (2496)

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T03:43:12Z

Summary

capture ziti stack failures and dump router/controller diagnostics in apply.sh

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

rowan-stein · 2026-03-18T04:10:25Z

CI is green ✅ — full pipeline passes in ~22 minutes.

Requesting review. All stacks deploy end-to-end: k8s → system → routing → ziti → data → platform.

noa-lucent

Good work overall — the stack structure is clean, the Helm configs match the architecture spec, and the iterative debugging (visible in the commit history) resulted in a working deployment. The TLS passthrough approach is correct and the policy/identity model aligns well with the issue requirements.

One major issue must be addressed before merging: the Management API readiness loop in apply.sh silently falls through on timeout, which will produce confusing downstream failures. The remaining comments are minor cleanups and consistency suggestions.

Summary:

1 major: missing timeout failure for Management API readiness check
4 minor: repeated kubeconfig path, unused variable/remote_state, explicit hostname list documentation, brittle port-forward readiness
2 nit: inconsistent gateway reference style, ctrl endpoint port documentation

apply.sh

stacks/routing/main.tf

stacks/ziti/variables.tf

stacks/ziti/remote_state.tf

stacks/ziti/main.tf

casey-brooks · 2026-03-18T04:22:09Z

Summary

added explicit management API readiness failures, kubeconfig reuse, and port-forward polling in apply.sh
documented explicit platform gateway host list and normalized ziti passthrough gateway references
cleaned up ziti stack inputs (removed unused var/remote state) and annotated router control port

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T04:47:05Z

Summary

switched Ziti management readiness to use a port-forward and removed the redundant port-forward block

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

noa-lucent

All 8 prior review comments have been verified against the actual code and resolved:

✅ Management API readiness timeout — now fails explicitly with error message and port-forward cleanup.
✅ Kubeconfig path duplication — extracted to KUBECONFIG_PATH variable, used consistently (including merge_kubeconfig).
✅ Port-forward readiness — replaced sleep 2 with a proper poll loop against 127.0.0.1:1281.
✅ Platform hostname list — extracted to local.platform_gateway_hosts with a clear maintenance comment.
✅ Gateway reference consistency — all three ziti VirtualServices now use the resource reference.
✅ Unused platform_namespace variable — removed.
✅ Unused system remote state — removed.
✅ ctrl endpoint port comment — added.

The readiness flow in apply.sh is now robust: service discovery → port-forward → poll-based readiness → fail-on-timeout. Clean work.

rowan-stein · 2026-03-18T05:12:17Z

✅ Ready to merge

CI: green (15m49s full pipeline)
Review: approved by @noa-lucent (all 8 comments addressed)
Blocked on: code owner / maintainer approval (branch protection)

casey-brooks · 2026-03-18T10:15:02Z

Summary

added Ziti namespace health verification to the CI health check script with diagnostics

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T10:48:59Z

Summary

added CoreDNS rewrite rules for Ziti hostnames and restart hook after controller install
introduced the null provider and updated provider lockfile for CoreDNS restart

Testing

terraform -chdir=/workspace/bootstrap/stacks/system init
terraform -chdir=/workspace/bootstrap fmt -check -recursive

noa-lucent

The CoreDNS rewrite approach is solid — good use of kubernetes_config_map_v1_data with the NodeHosts lifecycle ignore, and the null_resource trigger on Corefile hash is the right pattern. The comment explaining why the rewrites exist (enrollment JWTs advertise external hostnames) is helpful.

One major issue with the health check: the Ziti namespace check runs once before the poll loop, so transient not-ready states during pod initialization will cause an immediate hard failure with no retry — unlike every other check in the script.

Summary:

1 major: Ziti health check needs retry tolerance (move into poll loop or add its own)
2 minor: duplicated jq filter, unquoted kubeconfig path in local-exec

Previously-approved files are unchanged.

.github/scripts/verify_platform_health.sh

stacks/system/main.tf

casey-brooks · 2026-03-18T11:16:46Z

Summary

moved Ziti health checks into the polling loop with pending conditions and crash-loop diagnostics
extracted jq helpers and quoted kubeconfig path in the CoreDNS restart command

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

noa-lucent

All 3 prior comments verified against the actual code and resolved:

✅ Ziti health check retry tolerance — moved inside the main poll loop. Terminal states (CrashLoopBackOff/ImagePull) fail immediately with dump_ziti_diagnostics; transient states (missing/not-ready pods) are added to outstanding and retried until TOTAL_TIMEOUT.
✅ Duplicated jq filters — extracted into jq_unhealthy_pods() and jq_crash_backoffs() helpers. Also consolidated the platform crash-backoff inline filter to use the same helper — nice cleanup.
✅ Unquoted kubeconfig path — now properly double-quoted in the local-exec command.

Previously-approved files unchanged. LGTM — ready to merge.

casey-brooks · 2026-03-18T16:12:16Z

Summary

aligned Ziti VirtualService destinations and router endpoint with service ports
switched Ziti provider/apply.sh to use Istio management host (removed port-forward)
moved cert-manager/trust-manager/ziti-controller to ArgoCD Applications with wait gates and updated CoreDNS rewrite comments

Testing

terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T16:13:04Z

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T16:29:45Z

Summary

added deps stack with ArgoCD applications for cert-manager, trust-manager, and ziti-controller plus CoreDNS rewrite rules
removed Ziti dependency apps and CoreDNS rewrites from system stack and trimmed providers/outputs
wired deps stack into apply.sh and required app checks

Testing

terraform -chdir=/workspace/bootstrap/stacks/deps init
terraform -chdir=/workspace/bootstrap/stacks/system init
terraform -chdir=/workspace/bootstrap fmt -check -recursive

casey-brooks · 2026-03-18T16:30:08Z

Test & Lint Summary

terraform -chdir=/workspace/bootstrap/stacks/deps init
terraform -chdir=/workspace/bootstrap/stacks/system init
terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T17:28:19Z

Summary

rebased onto origin/main and resolved apply.sh to keep step_start/step_end markers around routing, deps, ziti, data, and platform
preserved ziti secret wait, management API readiness check, and diagnostics inside stack:ziti

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T17:45:53Z

Summary

registered Jetstack and OpenZiti Helm repositories with ArgoCD in deps stack
wired deps applications to depend on the repository resources

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T18:18:26Z

Summary

enabled server-side diff and ignore-differences for cert-manager webhook caBundle drift
added RespectIgnoreDifferences in cert-manager sync options

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T18:42:07Z

Summary

wrapped deps stack apply with ArgoCD diagnostics on failure (sync/health/conditions/resources + cert-manager pods/events)

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T19:07:32Z

Summary

set deps ArgoCD apps to wait=false and added apply-time polling for Synced/Healthy status with detailed diagnostics on timeout
preserved step_start/step_end wrapping for all stack sections while updating deps handling

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T19:37:37Z

Summary

downgraded cert-manager chart default to v1.16.5 to avoid Kubernetes 1.30 CRD requirements

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

rowan-stein · 2026-03-18T20:03:27Z

CI is now green ✅. Requesting review.

Summary of changes:

New stacks/deps stack deploying cert-manager, trust-manager, and ziti-controller as ArgoCD applications
cert-manager pinned to v1.16.5 (k8s 1.28 compatibility)
ServerSideApply + ServerSideDiff + caBundle ignoreDifferences for cert-manager CRD handling
Helm repos registered via argocd_repository resources
Shell-based polling with diagnostics in apply.sh (wait=false on ArgoCD apps, manual health checks)
VirtualService port mismatch fixed (uses local.ingress_port)
Hardcoded ports removed; router ctrl.endpoint uses local.ingress_port
Port-forward removed; ziti provider uses Istio VirtualService
CoreDNS rewrites trimmed to only ziti.agyn.dev (removed unnecessary ziti-mgmt and ziti-router)
CoreDNS restart removed (reload plugin handles it)
step_start/step_end timing wrappers restored after rebase

noa-lucent

The overall architecture is clean — the deps stack for ArgoCD-managed apps is a well-reasoned separation from the direct Helm installs in system, the TLS passthrough routing is correct, the Ziti provisioning (identities, services, policies) aligns with the spec, and the apply.sh orchestration is robust with proper readiness checks and failure diagnostics.

One major fix required:

jq operator precedence bug in jq_crash_backoffs: the // vs + precedence means initContainerStatuses is silently dropped when containerStatuses exists. Parenthesization fix is straightforward.

Two minor cleanups:

Unused chart version variables in stacks/system/variables.tf (with a misleadingly different cert_manager_chart_version default)
Unused local.ingress_port in stacks/system/remote_state.tf

.github/scripts/verify_platform_health.sh

stacks/system/variables.tf

stacks/system/remote_state.tf

casey-brooks · 2026-03-18T20:17:36Z

Summary

fixed jq operator precedence in crash backoff detection
removed unused cert-manager/trust-manager/ziti-controller variables from system stack
dropped unused ingress_port local from system remote state

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

casey-brooks · 2026-03-18T20:26:46Z

Summary

updated k3s version to v1.34.3-k3s1 and kubectl to v1.34.3
restored cert-manager chart default to v1.20.0 in deps

Test & Lint Summary

terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

noa-lucent

All 3 prior comments verified against actual code and resolved:

✅ jq operator precedence bug — jq_crash_backoffs now correctly parenthesizes each // independently: (($pod.status.containerStatuses // []) + ($pod.status.initContainerStatuses // [])).
✅ Unused chart version variables — all three removed from stacks/system/variables.tf.
✅ Unused local.ingress_port — removed from stacks/system/remote_state.tf.

New changes reviewed:

k3s upgrade v1.28.4-k3s1 → v1.34.3-k3s1 and matching kubectl v1.28.7 → v1.34.3 — consistent.
cert-manager restored to v1.20.0 in stacks/deps/variables.tf — the k8s upgrade resolves the selectableFields incompatibility, so the latest version is usable now.

No new issues. LGTM.

casey-brooks requested a review from a team as a code owner March 18, 2026 00:29

noa-lucent requested changes Mar 18, 2026

View reviewed changes

noa-lucent previously approved these changes Mar 18, 2026

View reviewed changes

casey-brooks dismissed noa-lucent’s stale review via 1413c28 March 18, 2026 10:14

noa-lucent requested changes Mar 18, 2026

View reviewed changes

.github/scripts/verify_platform_health.sh Outdated Show resolved Hide resolved

.github/scripts/verify_platform_health.sh Outdated Show resolved Hide resolved

stacks/system/main.tf Outdated Show resolved Hide resolved

noa-lucent previously approved these changes Mar 18, 2026

View reviewed changes

casey-brooks dismissed noa-lucent’s stale review via bf700e3 March 18, 2026 16:12

casey-brooks and others added 11 commits March 18, 2026 17:24

fix(ziti): place enrollment secrets

366c6d3

fix(ziti): point router at svc port

8f69056

fix(ziti): address review feedback

32226ce

fix(ci): verify ziti health

467eee3

fix(system): add ziti dns rewrites

727e6fa

fix(ci): refine ziti health checks

6188426

fix(bootstrap): align ziti routing and apps

7ed8e35

feat(deps): add argocd deps stack

c257d09

fix(deps): trim coredns rewrites

9a763e7

fix(deps): enable ssa sync options

b5ac0f8

ci: retrigger pipeline

05ce33f

casey-brooks force-pushed the noa/issue-116 branch from f8490e3 to 05ce33f Compare March 18, 2026 17:28

fix(deps): register helm repos

c1b2d4a

fix(deps): ignore cert-manager cabundle drift

03fbf13

fix(apply): add deps diagnostics

825fdc6

fix(deps): wait for argocd sync

7417f31

fix(deps): pin cert-manager v1.16.5

ef070fb

noa-lucent requested changes Mar 18, 2026

View reviewed changes

.github/scripts/verify_platform_health.sh Outdated Show resolved Hide resolved

stacks/system/variables.tf Outdated Show resolved Hide resolved

stacks/system/remote_state.tf Outdated Show resolved Hide resolved

chore: address review feedback

9dbf4b0

chore: bump k8s and cert-manager

389d7df

noa-lucent approved these changes Mar 18, 2026

View reviewed changes

Conversation

casey-brooks commented Mar 18, 2026

Summary

Testing

Issue

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

rowan-stein commented Mar 18, 2026

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

casey-brooks commented Mar 18, 2026

Summary

Testing

Uh oh!

noa-lucent left a comment

Choose a reason for hiding this comment

Uh oh!

rowan-stein commented Mar 18, 2026

Uh oh!