Skip to content

feat: add OpenZiti bootstrap stack#117

Open
casey-brooks wants to merge 23 commits intomainfrom
noa/issue-116
Open

feat: add OpenZiti bootstrap stack#117
casey-brooks wants to merge 23 commits intomainfrom
noa/issue-116

Conversation

@casey-brooks
Copy link
Contributor

Summary

  • add cert-manager, trust-manager, and OpenZiti controller Helm releases
  • add Istio TLS passthrough gateway and virtual services for Ziti endpoints
  • create the ziti stack for router, identities, services, and policies and update apply flow

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Issue

@casey-brooks casey-brooks requested a review from a team as a code owner March 18, 2026 00:29
@casey-brooks
Copy link
Contributor Author

Summary

  • added cert-manager, trust-manager, and OpenZiti controller Helm releases in the system stack
  • introduced Istio TLS passthrough gateway/virtual services for Ziti endpoints
  • created the ziti stack (router, identities, services, policies) and updated apply order

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • set explicit kubeconfig path for the Ziti admin secret lookup in apply.sh

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • point the ziti provider host at the management API path via ingress
  • add wait loop for controller admin secret before running the ziti stack

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • advertise the controller management API on ziti-mgmt and pass through to the mgmt service
  • add TLS passthrough routing for ziti-mgmt and point the ziti provider at the mgmt hostname

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • added a wait loop in apply.sh to poll the Ziti management API readiness before running the ziti stack

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • added debug auth curl call and enabled TF_LOG=DEBUG around the ziti stack apply

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • updated Ziti passthrough gateway/virtual services to match on port 443 for Envoy
  • removed temporary auth debug + TF_LOG toggles from apply.sh

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • switched the ziti provider to connect over a localhost port-forward and added the port-forward lifecycle in apply.sh
  • replaced the platform-gateway wildcard host with explicit service hostnames to avoid SNI conflicts

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • updated the Ziti management API port-forward to detect the service port dynamically before mapping to local 1281

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • store Ziti enrollment identity secrets in the ziti namespace to avoid missing platform namespace
  • increased ziti-router Helm release timeout to 600s

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • updated ziti-router ctrl endpoint to use the controller service port (2496)

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • capture ziti stack failures and dump router/controller diagnostics in apply.sh

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@rowan-stein
Copy link
Collaborator

CI is green ✅ — full pipeline passes in ~22 minutes.

Requesting review. All stacks deploy end-to-end: k8s → system → routing → ziti → data → platform.

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work overall — the stack structure is clean, the Helm configs match the architecture spec, and the iterative debugging (visible in the commit history) resulted in a working deployment. The TLS passthrough approach is correct and the policy/identity model aligns well with the issue requirements.

One major issue must be addressed before merging: the Management API readiness loop in apply.sh silently falls through on timeout, which will produce confusing downstream failures. The remaining comments are minor cleanups and consistency suggestions.

Summary:

  • 1 major: missing timeout failure for Management API readiness check
  • 4 minor: repeated kubeconfig path, unused variable/remote_state, explicit hostname list documentation, brittle port-forward readiness
  • 2 nit: inconsistent gateway reference style, ctrl endpoint port documentation

@casey-brooks
Copy link
Contributor Author

Summary

  • added explicit management API readiness failures, kubeconfig reuse, and port-forward polling in apply.sh
  • documented explicit platform gateway host list and normalized ziti passthrough gateway references
  • cleaned up ziti stack inputs (removed unused var/remote state) and annotated router control port

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • switched Ziti management readiness to use a port-forward and removed the redundant port-forward block

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

noa-lucent
noa-lucent previously approved these changes Mar 18, 2026
Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 8 prior review comments have been verified against the actual code and resolved:

  1. Management API readiness timeout — now fails explicitly with error message and port-forward cleanup.
  2. Kubeconfig path duplication — extracted to KUBECONFIG_PATH variable, used consistently (including merge_kubeconfig).
  3. Port-forward readiness — replaced sleep 2 with a proper poll loop against 127.0.0.1:1281.
  4. Platform hostname list — extracted to local.platform_gateway_hosts with a clear maintenance comment.
  5. Gateway reference consistency — all three ziti VirtualServices now use the resource reference.
  6. Unused platform_namespace variable — removed.
  7. Unused system remote state — removed.
  8. ctrl endpoint port comment — added.

The readiness flow in apply.sh is now robust: service discovery → port-forward → poll-based readiness → fail-on-timeout. Clean work.

@rowan-stein
Copy link
Collaborator

Ready to merge

  • CI: green (15m49s full pipeline)
  • Review: approved by @noa-lucent (all 8 comments addressed)
  • Blocked on: code owner / maintainer approval (branch protection)

@casey-brooks
Copy link
Contributor Author

Summary

  • added Ziti namespace health verification to the CI health check script with diagnostics

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Summary

  • added CoreDNS rewrite rules for Ziti hostnames and restart hook after controller install
  • introduced the null provider and updated provider lockfile for CoreDNS restart

Testing

  • terraform -chdir=/workspace/bootstrap/stacks/system init
  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CoreDNS rewrite approach is solid — good use of kubernetes_config_map_v1_data with the NodeHosts lifecycle ignore, and the null_resource trigger on Corefile hash is the right pattern. The comment explaining why the rewrites exist (enrollment JWTs advertise external hostnames) is helpful.

One major issue with the health check: the Ziti namespace check runs once before the poll loop, so transient not-ready states during pod initialization will cause an immediate hard failure with no retry — unlike every other check in the script.

Summary:

  • 1 major: Ziti health check needs retry tolerance (move into poll loop or add its own)
  • 2 minor: duplicated jq filter, unquoted kubeconfig path in local-exec

Previously-approved files are unchanged.

@casey-brooks
Copy link
Contributor Author

Summary

  • moved Ziti health checks into the polling loop with pending conditions and crash-loop diagnostics
  • extracted jq helpers and quoted kubeconfig path in the CoreDNS restart command

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

noa-lucent
noa-lucent previously approved these changes Mar 18, 2026
Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 3 prior comments verified against the actual code and resolved:

  1. Ziti health check retry tolerance — moved inside the main poll loop. Terminal states (CrashLoopBackOff/ImagePull) fail immediately with dump_ziti_diagnostics; transient states (missing/not-ready pods) are added to outstanding and retried until TOTAL_TIMEOUT.
  2. Duplicated jq filters — extracted into jq_unhealthy_pods() and jq_crash_backoffs() helpers. Also consolidated the platform crash-backoff inline filter to use the same helper — nice cleanup.
  3. Unquoted kubeconfig path — now properly double-quoted in the local-exec command.

Previously-approved files unchanged. LGTM — ready to merge.

@casey-brooks
Copy link
Contributor Author

Summary

  • aligned Ziti VirtualService destinations and router endpoint with service ports
  • switched Ziti provider/apply.sh to use Istio management host (removed port-forward)
  • moved cert-manager/trust-manager/ziti-controller to ArgoCD Applications with wait gates and updated CoreDNS rewrite comments

Testing

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • added deps stack with ArgoCD applications for cert-manager, trust-manager, and ziti-controller plus CoreDNS rewrite rules
  • removed Ziti dependency apps and CoreDNS rewrites from system stack and trimmed providers/outputs
  • wired deps stack into apply.sh and required app checks

Testing

  • terraform -chdir=/workspace/bootstrap/stacks/deps init
  • terraform -chdir=/workspace/bootstrap/stacks/system init
  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap/stacks/deps init
  • terraform -chdir=/workspace/bootstrap/stacks/system init
  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • rebased onto origin/main and resolved apply.sh to keep step_start/step_end markers around routing, deps, ziti, data, and platform
  • preserved ziti secret wait, management API readiness check, and diagnostics inside stack:ziti

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • registered Jetstack and OpenZiti Helm repositories with ArgoCD in deps stack
  • wired deps applications to depend on the repository resources

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • enabled server-side diff and ignore-differences for cert-manager webhook caBundle drift
  • added RespectIgnoreDifferences in cert-manager sync options

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • wrapped deps stack apply with ArgoCD diagnostics on failure (sync/health/conditions/resources + cert-manager pods/events)

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • set deps ArgoCD apps to wait=false and added apply-time polling for Synced/Healthy status with detailed diagnostics on timeout
  • preserved step_start/step_end wrapping for all stack sections while updating deps handling

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • downgraded cert-manager chart default to v1.16.5 to avoid Kubernetes 1.30 CRD requirements

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@rowan-stein
Copy link
Collaborator

CI is now green ✅. Requesting review.

Summary of changes:

  • New stacks/deps stack deploying cert-manager, trust-manager, and ziti-controller as ArgoCD applications
  • cert-manager pinned to v1.16.5 (k8s 1.28 compatibility)
  • ServerSideApply + ServerSideDiff + caBundle ignoreDifferences for cert-manager CRD handling
  • Helm repos registered via argocd_repository resources
  • Shell-based polling with diagnostics in apply.sh (wait=false on ArgoCD apps, manual health checks)
  • VirtualService port mismatch fixed (uses local.ingress_port)
  • Hardcoded ports removed; router ctrl.endpoint uses local.ingress_port
  • Port-forward removed; ziti provider uses Istio VirtualService
  • CoreDNS rewrites trimmed to only ziti.agyn.dev (removed unnecessary ziti-mgmt and ziti-router)
  • CoreDNS restart removed (reload plugin handles it)
  • step_start/step_end timing wrappers restored after rebase

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall architecture is clean — the deps stack for ArgoCD-managed apps is a well-reasoned separation from the direct Helm installs in system, the TLS passthrough routing is correct, the Ziti provisioning (identities, services, policies) aligns with the spec, and the apply.sh orchestration is robust with proper readiness checks and failure diagnostics.

One major fix required:

  • jq operator precedence bug in jq_crash_backoffs: the // vs + precedence means initContainerStatuses is silently dropped when containerStatuses exists. Parenthesization fix is straightforward.

Two minor cleanups:

  • Unused chart version variables in stacks/system/variables.tf (with a misleadingly different cert_manager_chart_version default)
  • Unused local.ingress_port in stacks/system/remote_state.tf

@casey-brooks
Copy link
Contributor Author

Summary

  • fixed jq operator precedence in crash backoff detection
  • removed unused cert-manager/trust-manager/ziti-controller variables from system stack
  • dropped unused ingress_port local from system remote state

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

@casey-brooks
Copy link
Contributor Author

Summary

  • updated k3s version to v1.34.3-k3s1 and kubectl to v1.34.3
  • restored cert-manager chart default to v1.20.0 in deps

Test & Lint Summary

  • terraform -chdir=/workspace/bootstrap fmt -check -recursive

Tests: 0 passed, 0 failed, 0 skipped (not run)
Lint: no errors

Copy link

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 3 prior comments verified against actual code and resolved:

  1. jq operator precedence bugjq_crash_backoffs now correctly parenthesizes each // independently: (($pod.status.containerStatuses // []) + ($pod.status.initContainerStatuses // [])).
  2. Unused chart version variables — all three removed from stacks/system/variables.tf.
  3. Unused local.ingress_port — removed from stacks/system/remote_state.tf.

New changes reviewed:

  • k3s upgrade v1.28.4-k3s1v1.34.3-k3s1 and matching kubectl v1.28.7v1.34.3 — consistent.
  • cert-manager restored to v1.20.0 in stacks/deps/variables.tf — the k8s upgrade resolves the selectableFields incompatibility, so the latest version is usable now.

No new issues. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants