fix: decouple connection retry backoff from TCP dial timeout #1387

linkvt · 2025-12-04T09:03:00Z

KEDA_HTTP_CONNECT_TIMEOUT was incorrectly used as both the TCP dial timeout and the initial retry backoff duration. This caused cold start response times to scale linearly with the timeout value as the first connection attempt by the interceptor usually failed (as it is a cold start).

Changes

Adopt Knatives proven backoff strategy which is more aggressive and doesn't depend on the connection timeout
Fix MinTotalBackoffDuration to calculate the sum correctly
Improve dial retry test coverage

Before and After

I generated a small script to simulate the before and after sleep times, as you can see it is now way more reasonable.
We now retry every second instead of every 16 seconds after 5 failed connection attempts or instead of 1min+ when using a timeout of 20 seconds.

Before with KEDA_HTTP_CONNECT_TIMEOUT=500ms (default)

Step	Sleep Duration	Cumulative
1	500ms	500ms
2	1s	1.5s
3	2s	3.5s
4	4s	7.5s
5	8s	15.5s
6	16s	31.5s
7	16s	47.5s

After (independent of KEDA_HTTP_CONNECT_TIMEOUT)

Step	Sleep Duration	Cumulative
1	50ms	50ms
2	70ms	120ms
3	98ms	218ms
4	137.2ms	355.2ms
5	192.08ms	547.28ms
6	268.912ms	816.192ms
7	376.4768ms	1.1926688s
8	527.067519ms	1.719736319s
9	737.894526ms	2.457630845s
10	1s	3.457630845s
11	1s	4.457630845s
12	1s	5.457630845s

More Results

Before with KEDA_HTTP_CONNECT_TIMEOUT=5s

Step	Sleep Duration	Cumulative
1	5s	5s
2	10s	15s
3	20s	35s
4	40s	1m15s
5	1m20s	2m35s
6	2m40s	5m15s
7	2m40s	7m55s

Before with KEDA_HTTP_CONNECT_TIMEOUT=20s

Step	Sleep Duration	Cumulative
1	20s	20s
2	40s	1m0s
3	1m20s	2m20s
4	2m40s	5m0s
5	5m20s	10m20s
6	10m40s	21m0s
7	10m40s	31m40s

Code

package main

import (
	"fmt"
	"time"

	"k8s.io/apimachinery/pkg/util/wait"
)

func printBackoff(name string, backoff wait.Backoff) {
	fmt.Printf("#### %s\n\n", name)
	fmt.Println("| Step | Sleep Duration | Cumulative |")
	fmt.Println("|------|----------------|------------|")

	steps := backoff.Steps + 2
	var cumulative time.Duration
	for i := 1; i <= steps; i++ {
		sleep := backoff.Step()
		cumulative += sleep
		fmt.Printf("| %d | %v | %v |\n", i, sleep, cumulative)
	}
	fmt.Println()
}

func main() {
	// Before: backoff duration was tied to KEDA_HTTP_CONNECT_TIMEOUT
	printBackoff("Before with KEDA_HTTP_CONNECT_TIMEOUT=500ms (default)", wait.Backoff{
		Duration: 500 * time.Millisecond,
		Factor:   2,
		Jitter:   0, // disable jitter for deterministic output
		Steps:    5,
	})

	printBackoff("Before with KEDA_HTTP_CONNECT_TIMEOUT=5s", wait.Backoff{
		Duration: 5 * time.Second,
		Factor:   2,
		Jitter:   0,
		Steps:    5,
	})

	printBackoff("Before with KEDA_HTTP_CONNECT_TIMEOUT=20s", wait.Backoff{
		Duration: 20 * time.Second,
		Factor:   2,
		Jitter:   0,
		Steps:    5,
	})

	// After: fixed backoff independent of KEDA_HTTP_CONNECT_TIMEOUT
	printBackoff("After (independent of KEDA_HTTP_CONNECT_TIMEOUT)", wait.Backoff{
		Cap:      10 * time.Second,
		Duration: 50 * time.Millisecond,
		Factor:   1.4,
		Jitter:   0, // disable jitter for deterministic output
		Steps:    10,
	})
}

Checklist

Commits are signed with Developer Certificate of Origin (DCO)
Changelog has been updated and is aligned with our changelog requirements
Any necessary documentation is added, such as:

Fixes #1385

snyk-io · 2025-12-04T09:03:10Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

KEDA_HTTP_CONNECT_TIMEOUT was incorrectly used as both the TCP dial timeout and the initial retry backoff duration. This caused cold start response times to scale linearly with the timeout value (e.g., 500ms timeout → 3.5s response, 5s timeout → 7.8s response). Changes: - Use fixed 50ms initial backoff duration for cold start polling - Adopt Knatives proven backoff strategy (factor 1.4, jitter 0.1) - Fix MinTotalBackoffDuration to calculate exponential (not linear) sum - Improve dial retry test coverage Signed-off-by: Vincent Link <vlink@redhat.com>

linkvt changed the title ~~fix: decouple connection retry backoff from TCP dial timeoutdG~~ fix: decouple connection retry backoff from TCP dial timeout Dec 4, 2025

linkvt force-pushed the decouple-connection-probing-from-timeout branch 2 times, most recently from ddc5de4 to 95e84db Compare December 9, 2025 08:35

linkvt force-pushed the decouple-connection-probing-from-timeout branch from 95e84db to ce8924f Compare December 9, 2025 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: decouple connection retry backoff from TCP dial timeout #1387

fix: decouple connection retry backoff from TCP dial timeout #1387

linkvt commented Dec 4, 2025 •

edited

Loading

Uh oh!

snyk-io bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: decouple connection retry backoff from TCP dial timeout #1387

Are you sure you want to change the base?

fix: decouple connection retry backoff from TCP dial timeout #1387

Conversation

linkvt commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Before and After

Before with KEDA_HTTP_CONNECT_TIMEOUT=500ms (default)

After (independent of KEDA_HTTP_CONNECT_TIMEOUT)

Before with KEDA_HTTP_CONNECT_TIMEOUT=5s

Before with KEDA_HTTP_CONNECT_TIMEOUT=20s

Checklist

Uh oh!

snyk-io bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

linkvt commented Dec 4, 2025 •

edited

Loading

snyk-io bot commented Dec 4, 2025 •

edited

Loading