Add configurable affinity_mode for egress pod selection#1209
Add configurable affinity_mode for egress pod selection#1209abhisheksingh-R41 wants to merge 6 commits into
Conversation
The current StartEgressAffinity always scores idle pods at 0.5 and busy
pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this
means the first pod to accept any job wins all subsequent jobs until its
CPU budget is exhausted — a staircase pattern rather than even spread.
The existing code comment acknowledges this is intentional for mixed
fleets ("avoids having many instances with one track request each, taking
availability from room composite"). However for a TrackEgress-only fleet
the packing strategy provides no benefit and causes sequential scale-out
delays.
This commit adds a configurable affinity_mode field to ServiceConfig:
pack (default) — current behaviour, unchanged
spread — CPU-proportional scoring; idle pods score 1.0 and
win immediately via MaximumAffinity short-circuit,
busy pods score proportionally so the least-loaded
pod wins after ShortCircuitTimeout. Best for
single-type (TrackEgress-only) fleets.
type_aware — RoomComposite/Web requests prefer idle pods (1.0
idle / 0.5 busy); Track/Participant requests spread
by CPU load. Best of both worlds for mixed fleets.
Default is "pack" so all existing deployments are unaffected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@frostbyte73 @milos-lk — would appreciate a review when you get a chance. This adds a configurable |
|
Could you provide more information about the motivation for handling track/participant requests differently? The main motivation for the current scheme is related to autoscaling, particularly down scaling: since draining an instance can take a long time, we want to make sure that the instance most likely to get terminated on a down scale event is the one with the east requests (ideally 0). If we were to take a patch to adjust the behavior, most extensive unit tests would be needed to ensure no regression over time. |
…e_aware modes Addresses both root causes of the 24/51-job-on-one-pod skew observed in the 2026-05-07 load test: Cause A (strict > tie-break): psrpc's ShortCircuitTimeout means the first replier wins when all idle pods return the same score. Fixed by subtracting rand.Float32()*0.001 jitter so idle pods produce distinct scores, making the strict-> comparison effectively random among equally-idle peers. Cause B (m.requests.Inc lag): StartEgressAffinity is called before StartEgress, so the winning pod's m.requests counter stays 0 across an entire 200ms burst window and all callers see score 1.0. Fixed by a pendingClaims atomic.Int32 that increments at affinity time and decrements at StartEgress accept (consumePendingClaim). A 2s self-decay timer guards against claims that are never fulfilled. A CAS loop in consumePendingClaim ensures exactly one decrement fires per increment even when StartEgress and the timer race. New monitor helper AvailableCPUFractionWithPending deducts pendingSlots*TrackCpuCost from the available budget so the score decreases with each in-flight claim. Image: asia-south1-docker.pkg.dev/avian-pulsar-430509-f6/r41-livekit/egress:v1.12.0-r41.2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eout In spread mode, an idle pod computed AvailableCPUFraction = 2.4/4.0 = 0.6. Since psrpc's MaximumAffinity is 1.0, the ShortCircuitTimeout (500ms fast path) never fired — every dispatch waited the full AffinityTimeout even when idle pods were ready. An idle pod (activeRequests=0, pendingClaims<=1) now returns 1.0 plus tiny jitter, triggering the 500ms short-circuit. Busy pods still return a proportional fraction. Also fix jitter direction bug in type_aware heavy-request path: 1.0 - jitter landed below MaximumAffinity; changed to 1.0 + jitter.
When all egress pods are simultaneously over their CPU budget (Chrome cold-start storm), every pod returns -1 and the dispatcher gets zero bids. The job is permanently dropped. New config fields soft_reject_floor (float, default 0) and max_active_requests (int, default 0) allow a pod to return a small positive score instead of -1 when CanAcceptRequest is false but the pod is below its design capacity. The dispatcher can then select this pod as a last resort instead of dropping the job. Guarded by MaxActiveRequests so genuinely full pods still hard-reject. Also add unit tests for softRejectScore helper.
type_aware mode is not used in production (affinity_mode: spread in both config blocks). Reverts idle-1.0 and jitter-direction changes in that branch to keep the diff minimal and production-relevant only. build-push-ar.yml removed — images are built and pushed manually.
Each egress pod now writes lk:egress:pod:{POD_NAME} every 2s with
"active/max" load immediately after RegisterStartEgressTopic — the
exact moment the pod can accept requests. Key is deleted on shutdown
to give CapacityManager instant visibility rather than waiting for the
5s TTL. Replaces Kubernetes pod-count-based capacity estimation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem
The current
StartEgressAffinityscores idle pods at 0.5 and busy pods at 1.0. Combined withMaximumAffinity=1in the psrpc client, this causes a staircase distribution: as soon as a pod accepts its first job it scores 1.0 and wins all subsequent jobs via the short-circuit, until its CPU budget is exhausted. The next pod then starts filling, and so on.The existing source comment acknowledges this is intentional:
This is the right behaviour for mixed fleets (Track + RoomComposite). However for a TrackEgress-only fleet the packing strategy provides no benefit and actively hurts KEDA/HPA scale-out — newly provisioned pods start idle at 0.5 and always lose to already-busy pods until those pods saturate.
Solution
Add a configurable
affinity_modefield toServiceConfig. The default (pack) preserves existing behaviour exactly — zero change for current deployments.pack(default)spreadtype_awareHow
spreadworksAvailableCPUFraction() == 1.0→ hitsMaximumAffinity=1and wins immediately (same speed as current busy-pod short-circuit).ShortCircuitTimeout=500msand picks the least-loaded pod.Changes
pkg/config/service.go— addsAffinityMode string \yaml:"affinity_mode"`toServiceConfig`pkg/stats/monitor.go— addsAvailableCPUFraction() float32(wraps existinggetCPUUsageLocked; idle returns 1.0, busy returns available/total)pkg/server/server_rpc.go— replacesStartEgressAffinitywith mode switch; addsisHeavyEgressRequesthelperpkg/server/server_rpc_test.go— table-driven unit tests forisHeavyEgressRequestcovering all 5 request typesBackwards compatibility
affinity_modedefaults to empty string which falls through todefault:case — identical to currentpackbehaviour.