Skip to content

lifecycle: Extract auto-pause coordinator into cube-lifecycle-manager#705

Open
chenhengqi wants to merge 1 commit into
masterfrom
lifecycle-mgr
Open

lifecycle: Extract auto-pause coordinator into cube-lifecycle-manager#705
chenhengqi wants to merge 1 commit into
masterfrom
lifecycle-mgr

Conversation

@chenhengqi

Copy link
Copy Markdown
Collaborator

Split cube-proxy-sidecar out of the CubeProxy image into a standalone control-plane service (CLM), add Redis-backed service discovery so CubeProxy can run in multiple replicas, and wire it into one-click under cube-sandbox-control.target. Wire protocol, Redis schema, and SDK behavior are unchanged.

@cubesandboxbot

cubesandboxbot Bot commented Jul 2, 2026

Copy link
Copy Markdown

PR #705 Review: Extract auto-pause coordinator into cube-lifecycle-manager

Overall: Well-structured refactor. The move from in-container sidecar to standalone service is cleanly executed, the Fleet interface abstraction is sound, and the Redis-based service discovery design is architecturally correct. The one-click deployment wiring is thorough (systemd, compose, prepare/start/stop scripts).

Strengths

  • Clean abstractiondiscovery.Fleet interface lets production use Redis discovery while tests inject a static fleet, keeping test isolation without external dependencies.
  • Graceful degraded modes – Empty fleet returns empty results rather than errors; transient Redis failures are logged and retried; bootstrap errors don't block startup.
  • Single-writer discipline – Lua proxy_registry restricts heartbeats to worker 0, avoiding duplicate Redis writes in multi-worker nginx.
  • Defensive copiesSnapshot() returns fresh slices so callers can sort/filter without mutating internal state.
  • Upgrade compatibility – ConsumerGroup defaults to the legacy "cube-proxy-sidecar" name so an in-place upgrade doesn't reprocess history.
  • Thorough test for dynamic fleetTestClient_DynamicFleet verifies the crucial invariant that proxypush.Client re-reads Fleet.Snapshot() on every call.

Key Concerns

  1. Security: Admin API surface now exposed on node IP – The default CUBE_LCM_LISTEN_ADDR changed from 127.0.0.1:8083 to 0.0.0.0:8083, and CubeProxy's admin listener went from loopback-only to CUBE_SANDBOX_NODE_IP. The shared admin token (CUBE_ADMIN_TOKEN) is optional (empty → no auth). Consider making token auth mandatory when listening on non-loopback addresses, or at minimum emitting a warning log when auth is disabled on an externally-reachable address.

  2. Missing tests for RedisDiscovery (redis.go)discovery/redis.go (RedisDiscovery.refreshOnce, pruneExpired, Run) is the most critical new production code path but has zero tests. The existing tests cover only the helper functions (Static, decodeEndpoint, diffJoins) in discovery.go. The join/leave transition detection and the prune pipeline would benefit from table-driven tests.

  3. Lua Redis client re-created per heartbeat tickproxy_registry.lua creates a new Redis client object on every 5s tick (redis:new({...}) inside publish()). When authentication is configured, this means an AUTH round-trip every 5s. Consider hoisting the client creation into setup() and reconnecting only on error.

  4. Serial registry replay on proxy joinreplayRegistryTo pushes all sandbox entries serially via individual HTTP calls. For deployments with thousands of sandboxes, this could delay proxy readiness. Consider batching or a batch admin endpoint.

  5. Undocumented environment variables – ~20 new environment variables (CUBE_LCM_*, CUBE_PROXY_REGISTRY_*, CUBE_PROXY_ADMIN_LISTEN, etc.) are not documented in any markdown docs, making deployment harder for operators.

  6. PruneExpired race window – There's a small window between reading expired heartbeat IDs and executing the pipeline delete where a fresh heartbeat could have its registry entry erroneously cleaned up. Acknowledged as self-healing in comments; reducing window duration via Lua scripting or WATCH could be considered for production hardening.

Minor

  • broadcast() fans out to all endpoints concurrently with no goroutine limit – fine for typical fleet sizes but should be monitored.
  • env directives in nginx.conf (CUBE_PROXY_REGISTRY_REDIS_PASSWORD) will expose the Redis password in /proc/self/environ – acceptable in containerized deployments but worth noting for operators doing bare-metal.

-- also after any error observed on the previous tick — that way a
-- transient Redis restart never leaves us in a state where the heartbeat
-- says "alive" but the registry row is gone.
if not state.registry_pushed then

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry_pushed guard creates a permanent desync scenario. If CLM's pruneExpired ever deletes this proxy's registry entry (e.g., heartbeat TTL expired and proxy recovered), registry_pushed stays true and the HSET is skipped forever. CLM sees a heartbeat ZSET member but HMGET returns nil, so the proxy is invisible until a Redis failure forces registry_pushed back to false.

HSET is idempotent — drop the guard and publish on every tick, or verify the entry exists before skipping.


-- publish is the single-tick worker. It (re)writes the registry entry the
-- first time it runs (or after a Redis outage), and always refreshes the
-- heartbeat score to now_ms. Every network operation is wrapped in pcall

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment claims every network operation is "wrapped in pcall" but the actual red:hset() and red:zadd() calls at lines 63 and 71 are bare calls with only return-value checking. If the redis_iresty library ever throws (not returns nil+err), the error propagates into ngx.timer.every which permanently stops calling publish() — the proxy silently stops heartbeating until nginx restart.

Comment thread cube-lifecycle-manager/cmd/cube-lifecycle-manager/main.go
discSvc *discovery.RedisDiscovery
staticFleet *discovery.Static
)
if len(cfg.CubeProxyAdminURLs) > 0 && cfg.UseStaticFleet {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting CUBE_LCM_USE_STATIC_FLEET=1 without CUBE_LCM_PROXY_ADMIN_URLS is silently ignored: the condition requires both len(urls) > 0 && UseStaticFleet. An operator enabling static mode but forgetting URLs gets Redis discovery instead, which may reach the wrong endpoints. Add a validation check in config.Validate().

// pruneExpired removes heartbeat members whose score is <= cutoff and drops
// the matching registry entries. Callers pass the same cutoff they used for
// the read so the window is consistent.
func (d *RedisDiscovery) pruneExpired(ctx context.Context, cutoff int64) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race window: if a proxy's heartbeat score is ≤cutoff at read time (ZRangeByScore line 207) but gets refreshed (ZADD) between this read and the pipeline execution (line 214-218), the ZREMRANGEBYSCORE won't remove the new heartbeat (score > cutoff) but the HDEL still targets this proxy_id from the stale expired list. Result: a live proxy's registry entry is deleted while its heartbeat is intact, making it invisible to CLM until the next Redis failure.

Consider a Lua script (EVAL) to atomically read, ZREMRANGEBYSCORE, and HDEL in one go.

@@ -129,8 +143,12 @@ func (s *Server) handleHealthz(w http.ResponseWriter, _ *http.Request) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleReadyz exposes registry_len and fleet_size. The CLM listener defaults to 0.0.0.0:8083. While these are not secrets, they leak cluster topology to any network peer. Consider either restricting /readyz to a separate loopback listener or documenting that operators should firewall port 8083.

if ldict then
hb_ms = ldict:get("cube_proxy_heartbeat_last_pushed_ms")
end
return reply(ngx.HTTP_OK, {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healthz endpoint exposes free_space() for all three shared dicts (META, STATE, LAST). With the admin listener now potentially on the node IP, this leaks capacity metrics that could be used to profile the system. Consider moving diagnostic data to a separate /admin/stats endpoint and keeping /admin/healthz minimal ({"ok":true}).

Comment thread cube-lifecycle-manager/cmd/cube-lifecycle-manager/main.go Outdated
end

local redis = require "redis_iresty"
local red = redis:new({

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis cosocket not pooled — A new Redis connection is created on every heartbeat tick but set_keepalive() is never called, so the cosocket is closed at GC time rather than returned to the pool. At scale (100+ proxy replicas each heartbeating every 5s), this creates ~20 TCP handshakes/second to the Redis server unnecessarily.

Suggestion: Add red:set_keepalive(10000, 100) after the ZADD call (around line 76) to reuse the connection across ticks.

# node) can push meta / state updates here. Token auth via
# $cube_admin_token below prevents outside abuse; operators may further
# tighten this with iptables / security groups.
CUBE_PROXY_ADMIN_LISTEN="${CUBE_PROXY_ADMIN_LISTEN:-${CUBE_SANDBOX_NODE_IP}}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admin server now network-reachable — Changing from 127.0.0.1 (hardcoded in the old nginx.conf) to CUBE_SANDBOX_NODE_IP (line 42 default) means the admin server (port 8082) is now reachable across the network. The token auth ($cube_admin_token) defaults to empty. Consider adding a deploy-time check that exits with a clear error when CUBE_ADMIN_TOKEN is empty and the admin listen is not loopback.

// Replay the current registry snapshot to the newly-arrived
// proxy. We must not block the discovery refresh loop, so
// this runs in its own goroutine with a bounded context.
go replayRegistryTo(rootCtx, pushClient, reg, ep, logger.Named("replay"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thundering-herd on startup — This goroutine is spawned for each new proxy joining. When N proxies register in the same discovery cycle (guaranteed on cluster restart), this fires N goroutines. Each iterates over every sandbox in the registry and pushes entries one-at-a-time via individual HTTP POST calls. At 100 proxies x 10,000 sandboxes = 1,000,000 sequential HTTP calls, delaying metadata delivery for minutes.

Suggestion: Add a batch admin endpoint to CubeProxy (e.g., POST /admin/meta/batch accepting an array of entries) so replayRegistryTo sends all entries in O(1) HTTP calls per proxy. At minimum, limit concurrent replays with a semaphore.

CUBE_LCM_CUBEMASTER_URL="${CUBE_LCM_CUBEMASTER_URL:-http://${CUBE_SANDBOX_NODE_IP}:8089}"

# CLM listens on host network so CubeProxy on other nodes can reach it.
CUBE_LCM_LISTEN_ADDR="${CUBE_LCM_LISTEN_ADDR:-0.0.0.0:8083}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLM HTTP server on 0.0.0.0:8083 — The /internal/resume endpoint has no authentication, and binding to 0.0.0.0 makes it reachable from any network interface. Previously this was bound to 127.0.0.1 (sidecar model). Consider authenticating this endpoint with the same CUBE_ADMIN_TOKEN used for admin pushes, or at minimum changing the default to 127.0.0.1 when running in single-host mode.

// pushClient reads Fleet.Snapshot() on every call, so a later swap-in of
// the RedisDiscovery Fleet is picked up automatically. We construct the
// discovery instance below so its onJoin can reference pushClient.
var pushClient *proxypush.Client

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile closure capture of nil pushClientpushClient is declared nil on line 101 and captured by the OnJoin closure on line 113. It only works because discSvc.Run() (started later, line 182) fires after pushClient is assigned on line 123. A future refactoring that starts the discovery loop earlier (e.g., an initial sync inside New()) would produce a nil-pointer panic at runtime with no compile-time warning.

Suggestion: Add a SetOnJoin(func(Endpoint)) method to RedisDiscovery and wire the callback after pushClient is fully initialized on line 123.


// Redis key names. Kept as package-level vars so tests can point them at a
// scratch namespace if needed.
var (

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable package-level Redis keys — These exported vars can be mutated by any imported package. The comment says they're mutable so tests can use scratch namespaces, but this creates global state risk: concurrent test cases can interfere with each other, and a stray import could change production key names.

Suggestion: Accept key names as constructor parameters (discovery.New(o Options)), or make them unexported and provide a test-only setter.

-e 's|^\(\s*listen \)8080\( ssl reuseport;\)|\1__CUBE_PROXY_HTTPS_PORT__\2|' \
-e 's|^\(\s*set \$host_proxy_port \)8081;|\1__CUBE_PROXY_HTTP_PORT__;|' \
-e 's|^\(\s*set \$host_proxy_port \)8080;|\1__CUBE_PROXY_HTTPS_PORT__;|' \
-e 's|^\(\s*listen \)127\.0\.0\.1:8082;|\1__CUBE_PROXY_ADMIN_LISTEN__:8082;|' \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CUBE_PROXY_ADMIN_LISTEN is not 127.0.0.1, should we require CUBE_ADMIN_TOKEN forcibly?

-- auto-pause / auto-resume coordination dicts. See nginx.conf admin server
-- block for routing.
--
-- All requests are loopback-bound (the listen directive is 127.0.0.1:8082);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to be inaccurate.

@chenhengqi chenhengqi force-pushed the lifecycle-mgr branch 2 times, most recently from 2614d4b to 6e0db02 Compare July 3, 2026 03:00
Split cube-proxy-sidecar out of the CubeProxy image into a standalone
control-plane service (CLM), add Redis-backed service discovery so
CubeProxy can run in multiple replicas, and wire it into one-click
under cube-sandbox-control.target. Wire protocol, Redis schema, and
SDK behavior are unchanged.

Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants