fix(e2e): improve infra setup reliability with retries and tolerant GC by r2k1 · Pull Request #8488 · Azure/AgentBaker

r2k1 · 2026-05-10T23:26:23Z

Summary

Infrastructure setup failures cause cascading test failures — one transient Azure error fails ALL tests sharing that cluster. This PR adds retries, tolerant cleanup, and handles cluster lifecycle states properly.

Evidence: failures across many builds over 3 weeks.

Changes

1. Tolerant stale node GC (`cluster.go`)

Previously any single failed K8s node delete killed cluster prep for all tests. The actual error was nodes "..." not found — the node was already gone.

Now: NotFound ignored, other errors logged as warnings, only fails if zero progress.

Evidence: 79 failures from "failed to delete 3 stale nodes, first error: ... not found"

2. Fix existingVMSS map (`cluster.go`)

VMSS queued for deletion were kept in the "existing" map, protecting their orphaned K8s nodes from cleanup. Now only kept VMSS go in the map.

3. Retry Bastion subnet GET with per-call timeout (`cluster.go`)

A single unresponsive Subnet.Get consumed the entire 20-minute cluster prep budget. Logs confirmed: "preparing cluster done (1200.0s)".

Now: each GET has a 30-second timeout, retried every 5s for up to 2 minutes.

Evidence: 179 failures across 3 builds from "get subnet AzureBastionSubnet: context deadline exceeded"

4. Handle Deleting/Canceled cluster states (`cluster.go`)

waitUntilClusterReady treated Deleting as an unknown state and failed immediately. A cluster stuck in Deleting from a previous run killed all tests.

Now: waits for Deleting/Canceled/Canceling to finish. If cluster gets deleted (404), returns nil so a new cluster is created.

Evidence: 74 failures across 2 builds from "cluster ... is in state Deleting, and won't retry"

Copilot

Pull request overview

This PR aims to improve E2E infrastructure setup reliability by adding retry/timeout patterns around Azure ARM operations and making cluster garbage-collection more tolerant so transient failures don’t cascade across all tests sharing a cluster.

Changes:

Add retry with per-call timeouts for Bastion subnet GET and for AKS subnet + route table readiness checks.
Make stale Kubernetes node GC tolerant to NotFound and non-fatal per-node delete errors.
Adjust VMSS GC bookkeeping so VMSS slated for deletion don’t shield their nodes from cleanup.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
e2e/cluster.go	Adds ARM GET retries/timeouts for Bastion subnet, changes VMSS “keep” set semantics, and makes stale node GC more tolerant.
e2e/aks_model.go	Adds retries/timeouts around AKS subnet/route table discovery and retries Azure Firewall creation.

+		return false, nil
+	})
+	if err != nil {
+		return nil, fmt.Errorf("get subnet %q in vnet %q rg %q: %w", bastionSubnetName, vnet.name, nodeRG, subnetGetErr)


 				continue
 			}
 			toolkit.Logf(ctx, "deleted vmss %q (age: %v)", *vmss.ID, time.Since(*vmss.Properties.TimeCreated))
+			// Don't add to existingVMSS — nodes from this VMSS should be cleaned up


djsly · 2026-05-12T14:50:16Z

+	// Retry the AKS subnet GET and route table lookup to tolerate:
+	// - transient ARM timeouts on the subnet GET
+	// - eventual consistency: kubenet route table may still be propagating after cluster create/reuse
+	var aksRTName string
+	err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+		callCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+		defer cancel()
+		aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil)
+		if subnetErr != nil {
+			toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr)
+			return false, nil
+		}
+		rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
+		if rtErr != nil {
+			toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr)
+			return false, nil
+		}
+		aksRTName = rtName
+		return true, nil
+	})


+	var fwResp armnetwork.AzureFirewallsClientCreateOrUpdateResponse
+	err = wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) {
+		fwPoller, fwErr := config.Azure.AzureFirewall.BeginCreateOrUpdate(ctx, rg, firewallName, *firewall, nil)
+		if fwErr != nil {
+			toolkit.Logf(ctx, "transient error starting firewall creation (retrying): %v", fwErr)
+			return false, nil
+		}
+		fwResp, fwErr = fwPoller.PollUntilDone(ctx, nil)
+		if fwErr != nil {
+			toolkit.Logf(ctx, "firewall creation did not complete (retrying): %v", fwErr)
+			return false, nil
+		}
+		return true, nil
+	})


timmy-wright

LGTM - it'd be good to know which copilot comments you've responded to. Currently they are all unresolved, but it looks like you took some of them into account.

djsly · 2026-05-12T14:49:51Z

+	err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+		callCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+		defer cancel()
+		aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil)
+		if subnetErr != nil {
+			toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr)
+			return false, nil
+		}
+		rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
+		if rtErr != nil {
+			toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr)
+			return false, nil


@r2k1 should consider ?

djsly · 2026-05-12T14:50:16Z

+	// Retry the AKS subnet GET and route table lookup to tolerate:
+	// - transient ARM timeouts on the subnet GET
+	// - eventual consistency: kubenet route table may still be propagating after cluster create/reuse
+	var aksRTName string
+	err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+		callCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+		defer cancel()
+		aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil)
+		if subnetErr != nil {
+			toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr)
+			return false, nil
+		}
+		rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet)
+		if rtErr != nil {
+			toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr)
+			return false, nil
+		}
+		aksRTName = rtName
+		return true, nil
+	})


Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

e2e/aks_model.go:455

The firewall creation retry loop retries on any error from BeginCreateOrUpdate or PollUntilDone, including potentially permanent configuration/authorization errors. This can mask the real failure and waste up to 10 minutes before surfacing it. Consider classifying errors (e.g., azcore.ResponseError 4xx other than 408/429) as non-retryable and returning immediately, retrying only on timeouts/429/5xx.

	}

	if firewallPrivateIP == "" {
		return fmt.Errorf("failed to get firewall private IP address")
	}

	// Add firewall routes to the existing AKS route table using individual
	// route operations. This avoids replacing the entire table (which would
	// race with cloud-provider-azure pod route updates) and preserves the
	// subnet association so pod CIDR routes remain active.
	firewallRoutes := []armnetwork.Route{
		{
			Name: to.Ptr("vnet-local"),
			Properties: &armnetwork.RoutePropertiesFormat{
				AddressPrefix: to.Ptr("10.224.0.0/16"),
				NextHopType:   to.Ptr(armnetwork.RouteNextHopTypeVnetLocal),

r2k1 · 2026-05-12T21:21:33Z

 		if err := kube.Typed.CoreV1().Nodes().Delete(ctx, node.Name, metav1.DeleteOptions{}); err != nil {
-			deleteErrors = append(deleteErrors, fmt.Errorf("deleting stale node %q: %w", node.Name, err))
+			if apierrors.IsNotFound(err) {
+				toolkit.Logf(ctx, "stale K8s node %q already gone", node.Name)
+				continue
+			}
+			toolkit.Logf(ctx, "warning: failed to delete stale K8s node %q: %v", node.Name, err)
+			failed++
 			continue
 		}
 		toolkit.Logf(ctx, "deleted stale K8s node %q (VMSS %q not found)", node.Name, vmssName)
+		deleted++
 	}

-	if len(deleteErrors) > 0 {
-		return fmt.Errorf("failed to delete %d stale nodes, first error: %w", len(deleteErrors), deleteErrors[0])
+	if failed > 0 && deleted == 0 {
+		return fmt.Errorf("failed to delete any of %d stale nodes", failed)
 	}


It's ok we only analyse logs

I use the ADO Ui to debug, why would we want to force user to fetch logs from artifacts ?

r2k1 · 2026-05-12T21:21:02Z

+		if errors.As(subnetGetErr, &subnetAzErr) && subnetAzErr.StatusCode == http.StatusNotFound {
+			return true, nil // 404 is expected — will create below


I don't want to deal with it. If we can't configure this, there is no point of saving budget. We will fail

Address infrastructure failures that cause cascading test failures (464 failures across 46 builds over 3 weeks). 1. Make stale node GC tolerant (cluster.go): - Ignore NotFound errors on node deletion (already gone) - Log other delete errors as warnings instead of failing - Only return error if zero nodes could be deleted Evidence: 79 failures from 'failed to delete 3 stale nodes' 2. Fix existingVMSS map (cluster.go): - Only add VMSS to existingVMSS if they are being kept - VMSS queued for deletion are excluded, so their stale K8s nodes can be cleaned up in the same pass - If VMSS deletion fails, keep in map to avoid orphaned deletes 3. Retry Bastion subnet GET (cluster.go): - Poll with backoff for up to 30s on transient ARM errors - 404 still handled normally (create subnet) Evidence: 179 failures from 'get subnet AzureBastionSubnet: context deadline exceeded' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 10, 2026 23:26

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, sulixu, surajssd, timmy-wright and zachary-bailey as code owners May 10, 2026 23:26

r2k1 temporarily deployed to test May 10, 2026 23:26 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 10, 2026 23:27 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

r2k1 force-pushed the e2e-infra-reliability branch from 8c798e4 to d79fabe Compare May 10, 2026 23:34

r2k1 temporarily deployed to test May 10, 2026 23:34 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 10, 2026 23:36

r2k1 force-pushed the e2e-infra-reliability branch from d79fabe to a8baa38 Compare May 10, 2026 23:36

r2k1 temporarily deployed to test May 10, 2026 23:36 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 10, 2026 23:36 View session

This comment was marked as low quality.

Sign in to view

This comment was marked as outdated.

Sign in to view

r2k1 force-pushed the e2e-infra-reliability branch from a8baa38 to e19fe06 Compare May 11, 2026 01:20

r2k1 temporarily deployed to test May 11, 2026 01:20 — with GitHub Actions Inactive

This comment was marked as low quality.

Sign in to view

timmy-wright approved these changes May 11, 2026

View reviewed changes

djsly approved these changes May 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 12, 2026 21:05

r2k1 temporarily deployed to test May 12, 2026 21:05 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 12, 2026 21:06 View session

r2k1 force-pushed the e2e-infra-reliability branch from 937c527 to e19fe06 Compare May 12, 2026 21:07

r2k1 temporarily deployed to test May 12, 2026 21:07 — with GitHub Actions Inactive

Copilot AI reviewed May 12, 2026

View reviewed changes

r2k1 force-pushed the e2e-infra-reliability branch from e19fe06 to 7a6568b Compare May 12, 2026 21:10

r2k1 temporarily deployed to test May 12, 2026 21:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): improve infra setup reliability with retries and tolerant GC#8488

fix(e2e): improve infra setup reliability with retries and tolerant GC#8488
r2k1 wants to merge 1 commit into
mainfrom
e2e-infra-reliability

r2k1 commented May 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

djsly May 12, 2026

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as low quality.

Uh oh!

timmy-wright left a comment

Uh oh!

djsly May 12, 2026

Uh oh!

djsly May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

r2k1 May 12, 2026

Uh oh!

djsly May 12, 2026

Uh oh!

r2k1 May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if errors.As(subnetGetErr, &subnetAzErr) && subnetAzErr.StatusCode == http.StatusNotFound {
		return true, nil // 404 is expected — will create below

Conversation

r2k1 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Tolerant stale node GC (cluster.go)

2. Fix existingVMSS map (cluster.go)

3. Retry Bastion subnet GET with per-call timeout (cluster.go)

4. Handle Deleting/Canceled cluster states (cluster.go)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

djsly May 12, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as low quality.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as low quality.

Uh oh!

timmy-wright left a comment

Choose a reason for hiding this comment

Uh oh!

djsly May 12, 2026

Choose a reason for hiding this comment

Uh oh!

djsly May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

r2k1 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

djsly May 12, 2026

Choose a reason for hiding this comment

Uh oh!

r2k1 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

r2k1 commented May 10, 2026 •

edited

Loading

1. Tolerant stale node GC (`cluster.go`)

2. Fix existingVMSS map (`cluster.go`)

3. Retry Bastion subnet GET with per-call timeout (`cluster.go`)

4. Handle Deleting/Canceled cluster states (`cluster.go`)