Add configurable affinity_mode for egress pod selection by abhisheksingh-R41 · Pull Request #1209 · livekit/egress

abhisheksingh-R41 · 2026-05-06T10:29:04Z

Problem

The current StartEgressAffinity scores idle pods at 0.5 and busy pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this causes a staircase distribution: as soon as a pod accepts its first job it scores 1.0 and wins all subsequent jobs via the short-circuit, until its CPU budget is exhausted. The next pod then starts filling, and so on.

The existing source comment acknowledges this is intentional:

"if this instance is idle and another is already handling some, the request will go to that server. This avoids having many instances with one track request each, taking availability from room composite."

This is the right behaviour for mixed fleets (Track + RoomComposite). However for a TrackEgress-only fleet the packing strategy provides no benefit and actively hurts KEDA/HPA scale-out — newly provisioned pods start idle at 0.5 and always lose to already-busy pods until those pods saturate.

Solution

Add a configurable affinity_mode field to ServiceConfig. The default (pack) preserves existing behaviour exactly — zero change for current deployments.

Mode	Affinity scoring	Best for
`pack` (default)	idle=0.5, busy=1.0	Mixed fleet (Track + RoomComposite)
`spread`	CPU-proportional; idle=1.0 → wins immediately	Single-type fleet (TrackEgress only)
`type_aware`	RoomComposite/Web prefer idle pods; Track/Participant spread by CPU load	Mixed fleet with smarter routing

How `spread` works

Idle pods return AvailableCPUFraction() == 1.0 → hits MaximumAffinity=1 and wins immediately (same speed as current busy-pod short-circuit).
Busy pods return their remaining CPU fraction (e.g. 0.6) → client waits ShortCircuitTimeout=500ms and picks the least-loaded pod.
Result: jobs distribute evenly across all pods rather than packing sequentially.

Changes

pkg/config/service.go — adds AffinityMode string \yaml:"affinity_mode"`toServiceConfig`
pkg/stats/monitor.go — adds AvailableCPUFraction() float32 (wraps existing getCPUUsageLocked; idle returns 1.0, busy returns available/total)
pkg/server/server_rpc.go — replaces StartEgressAffinity with mode switch; adds isHeavyEgressRequest helper
pkg/server/server_rpc_test.go — table-driven unit tests for isHeavyEgressRequest covering all 5 request types

Backwards compatibility

affinity_mode defaults to empty string which falls through to default: case — identical to current pack behaviour.
No changes to existing config parsing, prometheus metrics, or admission logic.

The current StartEgressAffinity always scores idle pods at 0.5 and busy pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this means the first pod to accept any job wins all subsequent jobs until its CPU budget is exhausted — a staircase pattern rather than even spread. The existing code comment acknowledges this is intentional for mixed fleets ("avoids having many instances with one track request each, taking availability from room composite"). However for a TrackEgress-only fleet the packing strategy provides no benefit and causes sequential scale-out delays. This commit adds a configurable affinity_mode field to ServiceConfig: pack (default) — current behaviour, unchanged spread — CPU-proportional scoring; idle pods score 1.0 and win immediately via MaximumAffinity short-circuit, busy pods score proportionally so the least-loaded pod wins after ShortCircuitTimeout. Best for single-type (TrackEgress-only) fleets. type_aware — RoomComposite/Web requests prefer idle pods (1.0 idle / 0.5 busy); Track/Participant requests spread by CPU load. Best of both worlds for mixed fleets. Default is "pack" so all existing deployments are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

abhisheksingh-R41 · 2026-05-06T10:51:51Z

@frostbyte73 @milos-lk — would appreciate a review when you get a chance. This adds a configurable affinity_mode to address staircase distribution in single-type (TrackEgress-only) fleets. Default is pack so existing deployments are unaffected.

biglittlebigben · 2026-05-06T18:56:21Z

Could you provide more information about the motivation for handling track/participant requests differently? The main motivation for the current scheme is related to autoscaling, particularly down scaling: since draining an instance can take a long time, we want to make sure that the instance most likely to get terminated on a down scale event is the one with the east requests (ideally 0).

If we were to take a patch to adjust the behavior, most extensive unit tests would be needed to ensure no regression over time.

…e_aware modes Addresses both root causes of the 24/51-job-on-one-pod skew observed in the 2026-05-07 load test: Cause A (strict > tie-break): psrpc's ShortCircuitTimeout means the first replier wins when all idle pods return the same score. Fixed by subtracting rand.Float32()*0.001 jitter so idle pods produce distinct scores, making the strict-> comparison effectively random among equally-idle peers. Cause B (m.requests.Inc lag): StartEgressAffinity is called before StartEgress, so the winning pod's m.requests counter stays 0 across an entire 200ms burst window and all callers see score 1.0. Fixed by a pendingClaims atomic.Int32 that increments at affinity time and decrements at StartEgress accept (consumePendingClaim). A 2s self-decay timer guards against claims that are never fulfilled. A CAS loop in consumePendingClaim ensures exactly one decrement fires per increment even when StartEgress and the timer race. New monitor helper AvailableCPUFractionWithPending deducts pendingSlots*TrackCpuCost from the available budget so the score decreases with each in-flight claim. Image: asia-south1-docker.pkg.dev/avian-pulsar-430509-f6/r41-livekit/egress:v1.12.0-r41.2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eout In spread mode, an idle pod computed AvailableCPUFraction = 2.4/4.0 = 0.6. Since psrpc's MaximumAffinity is 1.0, the ShortCircuitTimeout (500ms fast path) never fired — every dispatch waited the full AffinityTimeout even when idle pods were ready. An idle pod (activeRequests=0, pendingClaims<=1) now returns 1.0 plus tiny jitter, triggering the 500ms short-circuit. Busy pods still return a proportional fraction. Also fix jitter direction bug in type_aware heavy-request path: 1.0 - jitter landed below MaximumAffinity; changed to 1.0 + jitter.

When all egress pods are simultaneously over their CPU budget (Chrome cold-start storm), every pod returns -1 and the dispatcher gets zero bids. The job is permanently dropped. New config fields soft_reject_floor (float, default 0) and max_active_requests (int, default 0) allow a pod to return a small positive score instead of -1 when CanAcceptRequest is false but the pod is below its design capacity. The dispatcher can then select this pod as a last resort instead of dropping the job. Guarded by MaxActiveRequests so genuinely full pods still hard-reject. Also add unit tests for softRejectScore helper.

type_aware mode is not used in production (affinity_mode: spread in both config blocks). Reverts idle-1.0 and jitter-direction changes in that branch to keep the diff minimal and production-relevant only. build-push-ar.yml removed — images are built and pushed manually.

Each egress pod now writes lk:egress:pod:{POD_NAME} every 2s with "active/max" load immediately after RegisterStartEgressTopic — the exact moment the pod can accept requests. Key is deleted on shutdown to give CapacityManager instant visibility rather than waiting for the 5s TTL. Replaces Kubernetes pod-count-based capacity estimation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

abhisheksingh-R41 requested a review from a team as a code owner May 6, 2026 10:29

abhisheksingh-R41 and others added 5 commits May 7, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable affinity_mode for egress pod selection#1209

Add configurable affinity_mode for egress pod selection#1209
abhisheksingh-R41 wants to merge 6 commits into
livekit:mainfrom
recruit41:upstream-affinity-mode

abhisheksingh-R41 commented May 6, 2026 •

edited

Loading

Uh oh!

abhisheksingh-R41 commented May 6, 2026

Uh oh!

biglittlebigben commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abhisheksingh-R41 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How spread works

Changes

Backwards compatibility

Uh oh!

abhisheksingh-R41 commented May 6, 2026

Uh oh!

biglittlebigben commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhisheksingh-R41 commented May 6, 2026 •

edited

Loading

How `spread` works