Skip to content

Stop rotating docker TLS CA on update#3803

Open
ntner wants to merge 2 commits into
masterfrom
remove-mtls-cert-version-track
Open

Stop rotating docker TLS CA on update#3803
ntner wants to merge 2 commits into
masterfrom
remove-mtls-cert-version-track

Conversation

@ntner

@ntner ntner commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

The rack regenerates the internal Docker TLS certificate authority on every rack version update. Running EC2 instances keep the CA they were issued at boot, so after an upgrade the rack API presents a client certificate signed by a CA those instances no longer trust. Commands that reach an instance's Docker daemon then fail until the instance is replaced:

ERROR: Get "https://<instance-ip>:2376/containers/json?...": remote error: tls: unknown certificate authority

This change makes certificate generation idempotent again: a version update preserves the existing CA and certificates, and regeneration happens only when a certificate is missing or close to expiry.

Background

The Docker daemon on each instance runs in mTLS mode (--tlscacert /etc/ca.pem --tlscert /etc/cert.pem --tlskey /etc/key.pem). The rack API connects to it with a client certificate. The CA, the client certificate, and the instance's /etc/ca.pem all originate from the same CloudFormation custom resource (DockertTLSCertGenerate, handled by provider/aws/lambda/formation/handler/certificate.go). mTLS only succeeds when the rack's client certificate chains to the CA that the target instance currently trusts.

UpdateSelfSignedCertsForDocker had been keyed on the rack version: it stored the version in an SSM -version-track parameter and regenerated a brand new CA whenever that value differed from the current version. Because the version changes on every release, the CA rotated on every update. The rack API picks up the new client certificate as soon as it restarts, but a running instance only writes /etc/ca.pem once, at boot, so it keeps the previous CA until it is replaced. That mismatch is what surfaced the error above for convox run and convox exec against not-yet-cycled instances. (Build instances were already replaced on every update by an earlier change, so they were unaffected.)

The version keying was introduced to force racks holding the original one-year certificates to reissue them after the certificate lifetime was extended to one hundred years. With one-hundred-year certificates that one-time migration is no longer needed, and tying regeneration to the version is what creates the rotation.

Change

provider/aws/lambda/formation/handler/certificate.go:

  • UpdateSelfSignedCertsForDocker reads the existing certificate parameter, and returns it unchanged when it is present and more than two months from expiry. It regenerates only when the certificate is missing, unreadable, or near expiry. This restores the original (pre version-track) idempotent behavior.
  • CreateSelfSignedCertsForDocker no longer writes the -version-track parameter, and the versionParameterName helper is removed.
  • A guard returns to regeneration if the stored certificate cannot be PEM-decoded, so an unreadable parameter cannot fault the custom resource Lambda.
  • Certificate validity stays at one hundred years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant