fix(e2e): improve infra setup reliability with retries and tolerant GC#8488
fix(e2e): improve infra setup reliability with retries and tolerant GC#8488r2k1 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to improve E2E infrastructure setup reliability by adding retry/timeout patterns around Azure ARM operations and making cluster garbage-collection more tolerant so transient failures don’t cascade across all tests sharing a cluster.
Changes:
- Add retry with per-call timeouts for Bastion subnet GET and for AKS subnet + route table readiness checks.
- Make stale Kubernetes node GC tolerant to
NotFoundand non-fatal per-node delete errors. - Adjust VMSS GC bookkeeping so VMSS slated for deletion don’t shield their nodes from cleanup.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| e2e/cluster.go | Adds ARM GET retries/timeouts for Bastion subnet, changes VMSS “keep” set semantics, and makes stale node GC more tolerant. |
| e2e/aks_model.go | Adds retries/timeouts around AKS subnet/route table discovery and retries Azure Firewall creation. |
| return false, nil | ||
| }) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("get subnet %q in vnet %q rg %q: %w", bastionSubnetName, vnet.name, nodeRG, subnetGetErr) |
| continue | ||
| } | ||
| toolkit.Logf(ctx, "deleted vmss %q (age: %v)", *vmss.ID, time.Since(*vmss.Properties.TimeCreated)) | ||
| // Don't add to existingVMSS — nodes from this VMSS should be cleaned up |
| // Retry the AKS subnet GET and route table lookup to tolerate: | ||
| // - transient ARM timeouts on the subnet GET | ||
| // - eventual consistency: kubenet route table may still be propagating after cluster create/reuse | ||
| var aksRTName string | ||
| err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| callCtx, cancel := context.WithTimeout(ctx, 30*time.Second) | ||
| defer cancel() | ||
| aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil) | ||
| if subnetErr != nil { | ||
| toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr) | ||
| return false, nil | ||
| } | ||
| rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet) | ||
| if rtErr != nil { | ||
| toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr) | ||
| return false, nil | ||
| } | ||
| aksRTName = rtName | ||
| return true, nil | ||
| }) |
| var fwResp armnetwork.AzureFirewallsClientCreateOrUpdateResponse | ||
| err = wait.PollUntilContextTimeout(ctx, 10*time.Second, 10*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| fwPoller, fwErr := config.Azure.AzureFirewall.BeginCreateOrUpdate(ctx, rg, firewallName, *firewall, nil) | ||
| if fwErr != nil { | ||
| toolkit.Logf(ctx, "transient error starting firewall creation (retrying): %v", fwErr) | ||
| return false, nil | ||
| } | ||
| fwResp, fwErr = fwPoller.PollUntilDone(ctx, nil) | ||
| if fwErr != nil { | ||
| toolkit.Logf(ctx, "firewall creation did not complete (retrying): %v", fwErr) | ||
| return false, nil | ||
| } | ||
| return true, nil | ||
| }) |
8c798e4 to
d79fabe
Compare
d79fabe to
a8baa38
Compare
a8baa38 to
e19fe06
Compare
timmy-wright
left a comment
There was a problem hiding this comment.
LGTM - it'd be good to know which copilot comments you've responded to. Currently they are all unresolved, but it looks like you took some of them into account.
| err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| callCtx, cancel := context.WithTimeout(ctx, 30*time.Second) | ||
| defer cancel() | ||
| aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil) | ||
| if subnetErr != nil { | ||
| toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr) | ||
| return false, nil | ||
| } | ||
| rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet) | ||
| if rtErr != nil { | ||
| toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr) | ||
| return false, nil |
| // Retry the AKS subnet GET and route table lookup to tolerate: | ||
| // - transient ARM timeouts on the subnet GET | ||
| // - eventual consistency: kubenet route table may still be propagating after cluster create/reuse | ||
| var aksRTName string | ||
| err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) { | ||
| callCtx, cancel := context.WithTimeout(ctx, 30*time.Second) | ||
| defer cancel() | ||
| aksSubnetResp, subnetErr := config.Azure.Subnet.Get(callCtx, rg, vnet.name, "aks-subnet", nil) | ||
| if subnetErr != nil { | ||
| toolkit.Logf(ctx, "transient error getting AKS subnet (retrying): %v", subnetErr) | ||
| return false, nil | ||
| } | ||
| rtName, rtErr := ensureFirewallRouteTable(ctx, clusterModel, vnet.name, aksSubnetResp.Subnet) | ||
| if rtErr != nil { | ||
| toolkit.Logf(ctx, "route table not ready (retrying): %v", rtErr) | ||
| return false, nil | ||
| } | ||
| aksRTName = rtName | ||
| return true, nil | ||
| }) |
937c527 to
e19fe06
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
e2e/aks_model.go:455
- The firewall creation retry loop retries on any error from
BeginCreateOrUpdateorPollUntilDone, including potentially permanent configuration/authorization errors. This can mask the real failure and waste up to 10 minutes before surfacing it. Consider classifying errors (e.g.,azcore.ResponseError4xx other than 408/429) as non-retryable and returning immediately, retrying only on timeouts/429/5xx.
}
if firewallPrivateIP == "" {
return fmt.Errorf("failed to get firewall private IP address")
}
// Add firewall routes to the existing AKS route table using individual
// route operations. This avoids replacing the entire table (which would
// race with cloud-provider-azure pod route updates) and preserves the
// subnet association so pod CIDR routes remain active.
firewallRoutes := []armnetwork.Route{
{
Name: to.Ptr("vnet-local"),
Properties: &armnetwork.RoutePropertiesFormat{
AddressPrefix: to.Ptr("10.224.0.0/16"),
NextHopType: to.Ptr(armnetwork.RouteNextHopTypeVnetLocal),
| if err := kube.Typed.CoreV1().Nodes().Delete(ctx, node.Name, metav1.DeleteOptions{}); err != nil { | ||
| deleteErrors = append(deleteErrors, fmt.Errorf("deleting stale node %q: %w", node.Name, err)) | ||
| if apierrors.IsNotFound(err) { | ||
| toolkit.Logf(ctx, "stale K8s node %q already gone", node.Name) | ||
| continue | ||
| } | ||
| toolkit.Logf(ctx, "warning: failed to delete stale K8s node %q: %v", node.Name, err) | ||
| failed++ | ||
| continue | ||
| } | ||
| toolkit.Logf(ctx, "deleted stale K8s node %q (VMSS %q not found)", node.Name, vmssName) | ||
| deleted++ | ||
| } | ||
|
|
||
| if len(deleteErrors) > 0 { | ||
| return fmt.Errorf("failed to delete %d stale nodes, first error: %w", len(deleteErrors), deleteErrors[0]) | ||
| if failed > 0 && deleted == 0 { | ||
| return fmt.Errorf("failed to delete any of %d stale nodes", failed) | ||
| } |
There was a problem hiding this comment.
It's ok we only analyse logs
There was a problem hiding this comment.
I use the ADO Ui to debug, why would we want to force user to fetch logs from artifacts ?
| if errors.As(subnetGetErr, &subnetAzErr) && subnetAzErr.StatusCode == http.StatusNotFound { | ||
| return true, nil // 404 is expected — will create below |
There was a problem hiding this comment.
I don't want to deal with it. If we can't configure this, there is no point of saving budget. We will fail
Address infrastructure failures that cause cascading test failures
(464 failures across 46 builds over 3 weeks).
1. Make stale node GC tolerant (cluster.go):
- Ignore NotFound errors on node deletion (already gone)
- Log other delete errors as warnings instead of failing
- Only return error if zero nodes could be deleted
Evidence: 79 failures from 'failed to delete 3 stale nodes'
2. Fix existingVMSS map (cluster.go):
- Only add VMSS to existingVMSS if they are being kept
- VMSS queued for deletion are excluded, so their stale K8s
nodes can be cleaned up in the same pass
- If VMSS deletion fails, keep in map to avoid orphaned deletes
3. Retry Bastion subnet GET (cluster.go):
- Poll with backoff for up to 30s on transient ARM errors
- 404 still handled normally (create subnet)
Evidence: 179 failures from 'get subnet AzureBastionSubnet:
context deadline exceeded'
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e19fe06 to
7a6568b
Compare
Summary
Infrastructure setup failures cause cascading test failures — one transient Azure error fails ALL tests sharing that cluster. This PR adds retries, tolerant cleanup, and handles cluster lifecycle states properly.
Evidence: failures across many builds over 3 weeks.
Changes
1. Tolerant stale node GC (
cluster.go)Previously any single failed K8s node delete killed cluster prep for all tests. The actual error was
nodes "..." not found— the node was already gone.Now:
NotFoundignored, other errors logged as warnings, only fails if zero progress.Evidence: 79 failures from "failed to delete 3 stale nodes, first error: ... not found"
2. Fix existingVMSS map (
cluster.go)VMSS queued for deletion were kept in the "existing" map, protecting their orphaned K8s nodes from cleanup. Now only kept VMSS go in the map.
3. Retry Bastion subnet GET with per-call timeout (
cluster.go)A single unresponsive
Subnet.Getconsumed the entire 20-minute cluster prep budget. Logs confirmed: "preparing cluster done (1200.0s)".Now: each GET has a 30-second timeout, retried every 5s for up to 2 minutes.
Evidence: 179 failures across 3 builds from "get subnet AzureBastionSubnet: context deadline exceeded"
4. Handle Deleting/Canceled cluster states (
cluster.go)waitUntilClusterReadytreatedDeletingas an unknown state and failed immediately. A cluster stuck inDeletingfrom a previous run killed all tests.Now: waits for
Deleting/Canceled/Cancelingto finish. If cluster gets deleted (404), returns nil so a new cluster is created.Evidence: 74 failures across 2 builds from "cluster ... is in state Deleting, and won't retry"