lifecycle: Extract auto-pause coordinator into cube-lifecycle-manager by chenhengqi · Pull Request #705 · TencentCloud/CubeSandbox

chenhengqi · 2026-07-02T02:39:05Z

Split cube-proxy-sidecar out of the CubeProxy image into a standalone control-plane service (CLM), add Redis-backed service discovery so CubeProxy can run in multiple replicas, and wire it into one-click under cube-sandbox-control.target. Wire protocol, Redis schema, and SDK behavior are unchanged.

cubesandboxbot · 2026-07-02T02:45:14Z

PR #705 Review: Extract auto-pause coordinator into cube-lifecycle-manager

Overall: Well-structured refactor. The move from in-container sidecar to standalone service is cleanly executed, the Fleet interface abstraction is sound, and the Redis-based service discovery design is architecturally correct. The one-click deployment wiring is thorough (systemd, compose, prepare/start/stop scripts).

Strengths

Clean abstraction – discovery.Fleet interface lets production use Redis discovery while tests inject a static fleet, keeping test isolation without external dependencies.
Graceful degraded modes – Empty fleet returns empty results rather than errors; transient Redis failures are logged and retried; bootstrap errors don't block startup.
Single-writer discipline – Lua proxy_registry restricts heartbeats to worker 0, avoiding duplicate Redis writes in multi-worker nginx.
Defensive copies – Snapshot() returns fresh slices so callers can sort/filter without mutating internal state.
Upgrade compatibility – ConsumerGroup defaults to the legacy "cube-proxy-sidecar" name so an in-place upgrade doesn't reprocess history.
Thorough test for dynamic fleet – TestClient_DynamicFleet verifies the crucial invariant that proxypush.Client re-reads Fleet.Snapshot() on every call.

Key Concerns

Security: Admin API surface now exposed on node IP – The default CUBE_LCM_LISTEN_ADDR changed from 127.0.0.1:8083 to 0.0.0.0:8083, and CubeProxy's admin listener went from loopback-only to CUBE_SANDBOX_NODE_IP. The shared admin token (CUBE_ADMIN_TOKEN) is optional (empty → no auth). Consider making token auth mandatory when listening on non-loopback addresses, or at minimum emitting a warning log when auth is disabled on an externally-reachable address.
Missing tests for RedisDiscovery (redis.go) – discovery/redis.go (RedisDiscovery.refreshOnce, pruneExpired, Run) is the most critical new production code path but has zero tests. The existing tests cover only the helper functions (Static, decodeEndpoint, diffJoins) in discovery.go. The join/leave transition detection and the prune pipeline would benefit from table-driven tests.
Lua Redis client re-created per heartbeat tick – proxy_registry.lua creates a new Redis client object on every 5s tick (redis:new({...}) inside publish()). When authentication is configured, this means an AUTH round-trip every 5s. Consider hoisting the client creation into setup() and reconnecting only on error.
Serial registry replay on proxy join – replayRegistryTo pushes all sandbox entries serially via individual HTTP calls. For deployments with thousands of sandboxes, this could delay proxy readiness. Consider batching or a batch admin endpoint.
Undocumented environment variables – ~20 new environment variables (CUBE_LCM_*, CUBE_PROXY_REGISTRY_*, CUBE_PROXY_ADMIN_LISTEN, etc.) are not documented in any markdown docs, making deployment harder for operators.
PruneExpired race window – There's a small window between reading expired heartbeat IDs and executing the pipeline delete where a fresh heartbeat could have its registry entry erroneously cleaned up. Acknowledged as self-healing in comments; reducing window duration via Lua scripting or WATCH could be considered for production hardening.

Minor

broadcast() fans out to all endpoints concurrently with no goroutine limit – fine for typical fleet sizes but should be monitored.
env directives in nginx.conf (CUBE_PROXY_REGISTRY_REDIS_PASSWORD) will expose the Redis password in /proc/self/environ – acceptable in containerized deployments but worth noting for operators doing bare-metal.

cubesandboxbot · 2026-07-02T02:45:23Z

+    -- also after any error observed on the previous tick — that way a
+    -- transient Redis restart never leaves us in a state where the heartbeat
+    -- says "alive" but the registry row is gone.
+    if not state.registry_pushed then


The registry_pushed guard creates a permanent desync scenario. If CLM's pruneExpired ever deletes this proxy's registry entry (e.g., heartbeat TTL expired and proxy recovered), registry_pushed stays true and the HSET is skipped forever. CLM sees a heartbeat ZSET member but HMGET returns nil, so the proxy is invisible until a Redis failure forces registry_pushed back to false.

HSET is idempotent — drop the guard and publish on every tick, or verify the entry exists before skipping.

cubesandboxbot · 2026-07-02T02:45:24Z

+
+-- publish is the single-tick worker. It (re)writes the registry entry the
+-- first time it runs (or after a Redis outage), and always refreshes the
+-- heartbeat score to now_ms. Every network operation is wrapped in pcall


Comment claims every network operation is "wrapped in pcall" but the actual red:hset() and red:zadd() calls at lines 63 and 71 are bare calls with only return-value checking. If the redis_iresty library ever throws (not returns nil+err), the error propagates into ngx.timer.every which permanently stops calling publish() — the proxy silently stops heartbeating until nginx restart.

cubesandboxbot · 2026-07-02T02:45:25Z

+		discSvc       *discovery.RedisDiscovery
+		staticFleet   *discovery.Static
+	)
+	if len(cfg.CubeProxyAdminURLs) > 0 && cfg.UseStaticFleet {


Setting CUBE_LCM_USE_STATIC_FLEET=1 without CUBE_LCM_PROXY_ADMIN_URLS is silently ignored: the condition requires both len(urls) > 0 && UseStaticFleet. An operator enabling static mode but forgetting URLs gets Redis discovery instead, which may reach the wrong endpoints. Add a validation check in config.Validate().

cubesandboxbot · 2026-07-02T02:45:26Z

+// pruneExpired removes heartbeat members whose score is <= cutoff and drops
+// the matching registry entries. Callers pass the same cutoff they used for
+// the read so the window is consistent.
+func (d *RedisDiscovery) pruneExpired(ctx context.Context, cutoff int64) error {


Race window: if a proxy's heartbeat score is ≤cutoff at read time (ZRangeByScore line 207) but gets refreshed (ZADD) between this read and the pipeline execution (line 214-218), the ZREMRANGEBYSCORE won't remove the new heartbeat (score > cutoff) but the HDEL still targets this proxy_id from the stale expired list. Result: a live proxy's registry entry is deleted while its heartbeat is intact, making it invisible to CLM until the next Redis failure.

Consider a Lua script (EVAL) to atomically read, ZREMRANGEBYSCORE, and HDEL in one go.

cubesandboxbot · 2026-07-02T02:45:27Z

@@ -129,8 +143,12 @@ func (s *Server) handleHealthz(w http.ResponseWriter, _ *http.Request) {



handleReadyz exposes registry_len and fleet_size. The CLM listener defaults to 0.0.0.0:8083. While these are not secrets, they leak cluster topology to any network peer. Consider either restricting /readyz to a separate loopback listener or documenting that operators should firewall port 8083.

cubesandboxbot · 2026-07-02T02:45:27Z

+    if ldict then
+        hb_ms = ldict:get("cube_proxy_heartbeat_last_pushed_ms")
+    end
    return reply(ngx.HTTP_OK, {


The healthz endpoint exposes free_space() for all three shared dicts (META, STATE, LAST). With the admin listener now potentially on the node IP, this leaks capacity metrics that could be used to profile the system. Consider moving diagnostic data to a separate /admin/stats endpoint and keeping /admin/healthz minimal ({"ok":true}).

cubesandboxbot · 2026-07-02T04:20:53Z

+    end
+
+    local redis = require "redis_iresty"
+    local red = redis:new({


Redis cosocket not pooled — A new Redis connection is created on every heartbeat tick but set_keepalive() is never called, so the cosocket is closed at GC time rather than returned to the pool. At scale (100+ proxy replicas each heartbeating every 5s), this creates ~20 TCP handshakes/second to the Redis server unnecessarily.

Suggestion: Add red:set_keepalive(10000, 100) after the ZADD call (around line 76) to reuse the connection across ticks.

cubesandboxbot · 2026-07-02T04:21:17Z

+# node) can push meta / state updates here. Token auth via
+# $cube_admin_token below prevents outside abuse; operators may further
+# tighten this with iptables / security groups.
+CUBE_PROXY_ADMIN_LISTEN="${CUBE_PROXY_ADMIN_LISTEN:-${CUBE_SANDBOX_NODE_IP}}"


Admin server now network-reachable — Changing from 127.0.0.1 (hardcoded in the old nginx.conf) to CUBE_SANDBOX_NODE_IP (line 42 default) means the admin server (port 8082) is now reachable across the network. The token auth ($cube_admin_token) defaults to empty. Consider adding a deploy-time check that exits with a clear error when CUBE_ADMIN_TOKEN is empty and the admin listen is not loopback.

cubesandboxbot · 2026-07-02T04:21:25Z

+				// Replay the current registry snapshot to the newly-arrived
+				// proxy. We must not block the discovery refresh loop, so
+				// this runs in its own goroutine with a bounded context.
+				go replayRegistryTo(rootCtx, pushClient, reg, ep, logger.Named("replay"))


Thundering-herd on startup — This goroutine is spawned for each new proxy joining. When N proxies register in the same discovery cycle (guaranteed on cluster restart), this fires N goroutines. Each iterates over every sandbox in the registry and pushes entries one-at-a-time via individual HTTP POST calls. At 100 proxies x 10,000 sandboxes = 1,000,000 sequential HTTP calls, delaying metadata delivery for minutes.

Suggestion: Add a batch admin endpoint to CubeProxy (e.g., POST /admin/meta/batch accepting an array of entries) so replayRegistryTo sends all entries in O(1) HTTP calls per proxy. At minimum, limit concurrent replays with a semaphore.

cubesandboxbot · 2026-07-02T04:21:41Z

+CUBE_LCM_CUBEMASTER_URL="${CUBE_LCM_CUBEMASTER_URL:-http://${CUBE_SANDBOX_NODE_IP}:8089}"
+
+# CLM listens on host network so CubeProxy on other nodes can reach it.
+CUBE_LCM_LISTEN_ADDR="${CUBE_LCM_LISTEN_ADDR:-0.0.0.0:8083}"


CLM HTTP server on 0.0.0.0:8083 — The /internal/resume endpoint has no authentication, and binding to 0.0.0.0 makes it reachable from any network interface. Previously this was bound to 127.0.0.1 (sidecar model). Consider authenticating this endpoint with the same CUBE_ADMIN_TOKEN used for admin pushes, or at minimum changing the default to 127.0.0.1 when running in single-host mode.

cubesandboxbot · 2026-07-02T04:21:52Z

+	// pushClient reads Fleet.Snapshot() on every call, so a later swap-in of
+	// the RedisDiscovery Fleet is picked up automatically. We construct the
+	// discovery instance below so its onJoin can reference pushClient.
+	var pushClient *proxypush.Client


Fragile closure capture of nil pushClient — pushClient is declared nil on line 101 and captured by the OnJoin closure on line 113. It only works because discSvc.Run() (started later, line 182) fires after pushClient is assigned on line 123. A future refactoring that starts the discovery loop earlier (e.g., an initial sync inside New()) would produce a nil-pointer panic at runtime with no compile-time warning.

Suggestion: Add a SetOnJoin(func(Endpoint)) method to RedisDiscovery and wire the callback after pushClient is fully initialized on line 123.

cubesandboxbot · 2026-07-02T04:21:52Z

+
+// Redis key names. Kept as package-level vars so tests can point them at a
+// scratch namespace if needed.
+var (


Mutable package-level Redis keys — These exported vars can be mutated by any imported package. The comment says they're mutable so tests can use scratch namespaces, but this creates global state risk: concurrent test cases can interfere with each other, and a stray import could change production key names.

Suggestion: Accept key names as constructor parameters (discovery.New(o Options)), or make them unexported and provide a test-only setter.

chengjoey · 2026-07-02T06:38:42Z

      -e 's|^\(\s*listen \)8080\( ssl reuseport;\)|\1__CUBE_PROXY_HTTPS_PORT__\2|' \
      -e 's|^\(\s*set \$host_proxy_port \)8081;|\1__CUBE_PROXY_HTTP_PORT__;|' \
      -e 's|^\(\s*set \$host_proxy_port \)8080;|\1__CUBE_PROXY_HTTPS_PORT__;|' \
+      -e 's|^\(\s*listen \)127\.0\.0\.1:8082;|\1__CUBE_PROXY_ADMIN_LISTEN__:8082;|' \


If CUBE_PROXY_ADMIN_LISTEN is not 127.0.0.1, should we require CUBE_ADMIN_TOKEN forcibly?

chengjoey · 2026-07-02T06:39:31Z

 -- auto-pause / auto-resume coordination dicts. See nginx.conf admin server
 -- block for routing.
 --
 -- All requests are loopback-bound (the listen directive is 127.0.0.1:8082);


This comment seems to be inaccurate.

Split cube-proxy-sidecar out of the CubeProxy image into a standalone control-plane service (CLM), add Redis-backed service discovery so CubeProxy can run in multiple replicas, and wire it into one-click under cube-sandbox-control.target. Wire protocol, Redis schema, and SDK behavior are unchanged. Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>

chenhengqi requested review from fslongjin, ls-ggg, tinklone and up2wing as code owners July 2, 2026 02:39

cubesandboxbot Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread cube-lifecycle-manager/cmd/cube-lifecycle-manager/main.go

cubesandboxbot Bot reviewed Jul 2, 2026

View reviewed changes

chengjoey reviewed Jul 2, 2026

View reviewed changes

Comment thread cube-lifecycle-manager/cmd/cube-lifecycle-manager/main.go Outdated

chenhengqi force-pushed the lifecycle-mgr branch from 1ed5c6b to 95333fa Compare July 2, 2026 03:57

cubesandboxbot Bot reviewed Jul 2, 2026

View reviewed changes

chengjoey reviewed Jul 2, 2026

View reviewed changes

chenhengqi force-pushed the lifecycle-mgr branch 2 times, most recently from 2614d4b to 6e0db02 Compare July 3, 2026 03:00

chenhengqi force-pushed the lifecycle-mgr branch from 6e0db02 to 56f95af Compare July 3, 2026 03:02

chenhengqi mentioned this pull request Jul 3, 2026

feat(cubeproxy): add plaintext gRPC ingress on port 9090 #679

Closed

		@@ -129,8 +143,12 @@ func (s Server) handleHealthz(w http.ResponseWriter, _ http.Request) {

Uh oh!

Conversation

chenhengqi commented Jul 2, 2026

Uh oh!

cubesandboxbot Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #705 Review: Extract auto-pause coordinator into cube-lifecycle-manager

Strengths

Key Concerns

Minor

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cubesandboxbot Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

chengjoey Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

chengjoey Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cubesandboxbot Bot commented Jul 2, 2026 •

edited

Loading