Skip to content

Conversation

@linkvt
Copy link
Contributor

@linkvt linkvt commented Dec 4, 2025

KEDA_HTTP_CONNECT_TIMEOUT was incorrectly used as both the TCP dial timeout and the initial retry backoff duration. This caused cold start response times to scale linearly with the timeout value as the first connection attempt by the interceptor usually failed (as it is a cold start).

Changes

  • Adopt Knatives proven backoff strategy which is more aggressive and doesn't depend on the connection timeout
  • Fix MinTotalBackoffDuration to calculate the sum correctly
  • Improve dial retry test coverage

Before and After

I generated a small script to simulate the before and after sleep times, as you can see it is now way more reasonable.
We now retry every second instead of every 16 seconds after 5 failed connection attempts or instead of 1min+ when using a timeout of 20 seconds.

Before with KEDA_HTTP_CONNECT_TIMEOUT=500ms (default)

Step Sleep Duration Cumulative
1 500ms 500ms
2 1s 1.5s
3 2s 3.5s
4 4s 7.5s
5 8s 15.5s
6 16s 31.5s
7 16s 47.5s

After (independent of KEDA_HTTP_CONNECT_TIMEOUT)

Step Sleep Duration Cumulative
1 50ms 50ms
2 70ms 120ms
3 98ms 218ms
4 137.2ms 355.2ms
5 192.08ms 547.28ms
6 268.912ms 816.192ms
7 376.4768ms 1.1926688s
8 527.067519ms 1.719736319s
9 737.894526ms 2.457630845s
10 1s 3.457630845s
11 1s 4.457630845s
12 1s 5.457630845s
More Results

Before with KEDA_HTTP_CONNECT_TIMEOUT=5s

Step Sleep Duration Cumulative
1 5s 5s
2 10s 15s
3 20s 35s
4 40s 1m15s
5 1m20s 2m35s
6 2m40s 5m15s
7 2m40s 7m55s

Before with KEDA_HTTP_CONNECT_TIMEOUT=20s

Step Sleep Duration Cumulative
1 20s 20s
2 40s 1m0s
3 1m20s 2m20s
4 2m40s 5m0s
5 5m20s 10m20s
6 10m40s 21m0s
7 10m40s 31m40s
Code
package main

import (
	"fmt"
	"time"

	"k8s.io/apimachinery/pkg/util/wait"
)

func printBackoff(name string, backoff wait.Backoff) {
	fmt.Printf("#### %s\n\n", name)
	fmt.Println("| Step | Sleep Duration | Cumulative |")
	fmt.Println("|------|----------------|------------|")

	steps := backoff.Steps + 2
	var cumulative time.Duration
	for i := 1; i <= steps; i++ {
		sleep := backoff.Step()
		cumulative += sleep
		fmt.Printf("| %d | %v | %v |\n", i, sleep, cumulative)
	}
	fmt.Println()
}

func main() {
	// Before: backoff duration was tied to KEDA_HTTP_CONNECT_TIMEOUT
	printBackoff("Before with KEDA_HTTP_CONNECT_TIMEOUT=500ms (default)", wait.Backoff{
		Duration: 500 * time.Millisecond,
		Factor:   2,
		Jitter:   0, // disable jitter for deterministic output
		Steps:    5,
	})

	printBackoff("Before with KEDA_HTTP_CONNECT_TIMEOUT=5s", wait.Backoff{
		Duration: 5 * time.Second,
		Factor:   2,
		Jitter:   0,
		Steps:    5,
	})

	printBackoff("Before with KEDA_HTTP_CONNECT_TIMEOUT=20s", wait.Backoff{
		Duration: 20 * time.Second,
		Factor:   2,
		Jitter:   0,
		Steps:    5,
	})

	// After: fixed backoff independent of KEDA_HTTP_CONNECT_TIMEOUT
	printBackoff("After (independent of KEDA_HTTP_CONNECT_TIMEOUT)", wait.Backoff{
		Cap:      10 * time.Second,
		Duration: 50 * time.Millisecond,
		Factor:   1.4,
		Jitter:   0, // disable jitter for deterministic output
		Steps:    10,
	})
}

Checklist

Fixes #1385

@snyk-io
Copy link

snyk-io bot commented Dec 4, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@linkvt linkvt changed the title fix: decouple connection retry backoff from TCP dial timeoutdG fix: decouple connection retry backoff from TCP dial timeout Dec 4, 2025
@linkvt linkvt force-pushed the decouple-connection-probing-from-timeout branch 2 times, most recently from ddc5de4 to 95e84db Compare December 9, 2025 08:35
KEDA_HTTP_CONNECT_TIMEOUT was incorrectly used as both the TCP dial
timeout and the initial retry backoff duration. This caused cold start
response times to scale linearly with the timeout value (e.g., 500ms
timeout → 3.5s response, 5s timeout → 7.8s response).

Changes:
- Use fixed 50ms initial backoff duration for cold start polling
- Adopt Knatives proven backoff strategy (factor 1.4, jitter 0.1)
- Fix MinTotalBackoffDuration to calculate exponential (not linear) sum
- Improve dial retry test coverage

Signed-off-by: Vincent Link <vlink@redhat.com>
@linkvt linkvt force-pushed the decouple-connection-probing-from-timeout branch from 95e84db to ce8924f Compare December 9, 2025 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High KEDA_HTTP_CONNECT_TIMEOUT causes slow cold starts

1 participant