Skip to content

Conversation

@henrypark133
Copy link

@henrypark133 henrypark133 commented Jan 15, 2026

High Availability Architecture: Introduces an Nginx load balancer and dynamic Docker Compose logic to support running multiple VPC API servers in parallel (toggled via DSTACK_VPC_HA_MODE).

Database Persistence & S3 Replication: Integrates Litestream to replicate the Headscale SQLite database to S3 in real-time and implements automated backup/restore for the noise_private.key to preserve server identity across redeployments.

Self-Healing Client Nodes: Upgrades node scripts with intelligent retry logic, connection timeouts, and an auto-recovery mechanism that automatically triggers re-registration if VPN authentication fails.

Operational Stability: Implements container restart limits to prevent infinite crash loops, updates health checks to support the new load balancer, and bumps core dependencies (Docker, OpenSSL) to the latest versions.

https://www.notion.so/jasnahcom/DStack-VPC-High-Availability-Architecture-Changes-2e229a6526bf80a38ea9e5aaef7cbd1a

Testing

Test Environment

  • VPC Server: cpu01, port 2222
  • Test Node A: cpu02, port 2224
  • Test Node B: cpu02, port 2225
  • S3 Bucket: nearai-vpc-headscale-backups/test-headscale/

Scenario 1: Initial Deployment & Backup

  • ✅ VPC server deployed with S3 backup enabled
  • ✅ Litestream creates generations/ folder in S3
  • ✅ noise_private.key (72 bytes) uploaded to S3 after headscale starts
  • ✅ test-node-a registered successfully (IP: 100.128.0.1)

Scenario 2: VPC Server Redeploy (Fresh CVM)

  • ✅ Litestream restores DB from S3 before headscale starts
  • ✅ noise_private.key restored from S3
  • ✅ test-node-a preserved in headscale nodes list (from restored DB)
  • ✅ test-node-a auto-reconnects without manual intervention

Scenario 3: New Node After Restore

  • ✅ test-node-b registered successfully (IP: 100.128.0.2)
  • ✅ P2P connectivity: node-a → node-b (3ms)
  • ✅ P2P connectivity: node-b → node-a (2ms)

…ups of sqlite, backup of noise private key, and more robust node configurations
if aws s3 ls "\$S3_PATH" >/dev/null 2>&1; then
echo "noise_private.key already exists in S3, skipping upload"
elif [ -f "\$KEY_PATH" ]; then
echo "Uploading noise_private.key to S3..."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we want to encrypt the key to avoid us being able to retrieve it from s3 and run headscale outside a CVM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. using the disk key at /dstack/.host-shared/.appkeys.json | jq .disk_crypt_key

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Evrard,
I just experimented a bit. I don't think we should do this because the cvm configurations might (will) change with most deployments and the key will be different as well, so the decryptions won't work. I do think encryptions of some sort would be very useful here, but I think we can maybe do this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants