Skip to content

fix(webhook): fix cache error on k8s client pagination.#1059

Open
Zenithar wants to merge 3 commits intomainfrom
zenithar/chaos-controller/fix_safetyNetCountTooLarge_error
Open

fix(webhook): fix cache error on k8s client pagination.#1059
Zenithar wants to merge 3 commits intomainfrom
zenithar/chaos-controller/fix_safetyNetCountTooLarge_error

Conversation

@Zenithar
Copy link
Copy Markdown
Contributor

@Zenithar Zenithar commented Apr 9, 2026

What does this PR do?

  • Adds new functionality
  • Alters existing functionality
  • Fixes a bug
  • Improves documentation or testing

Summary

Fixes the webhook admission error:

"chaos-controller.chaos-engineering.svc" denied the request: error checking for
countNotTooLarge safetynet: error listing target pods: continue list option is
not supported by the cache

Root Cause

safetyNetCountNotTooLarge uses paginated listing (Limit/Continue) with the
controller-runtime cached client (Manager.GetClient()). The cache has two
known issues (kubernetes-sigs/controller-runtime#3044):

  1. Continue is not supported — the cache rejects it with a hard error.
  2. When Limit is set, the cache silently returns an incomplete list
    truncated at the limit, without signaling that the list is incomplete.

This means neither paginated nor unpaginated listing works correctly with the
cached client for counting resources.

Fix

Use mgr.GetAPIReader() — a direct API server client that supports proper
pagination — for the List calls in safetyNetCountNotTooLarge.

Changes:

  • utils/utils.go: Add APIReader client.Reader field to
    SetupWebhookWithManagerConfig.
  • main.go: Pass mgr.GetAPIReader() in the webhook config
    (GetAPIReader() is already used for watchers).
  • api/v1beta1/disruption_webhook.go:
    • Add apiReader package-level variable, initialized from config.
    • Restore paginated List calls (Limit: 1000 + Continue loops)
      using apiReader instead of the cached k8sClient.
    • Fix incorrect error message "error listing target pods" → "error listing
      nodes" in the node-level branch.
    • Pass ctx from the admission handler instead of context.Background().
    • Add nil guards to initialSafetyNets and safetyNetMissingNetworkFilters.
    • Change safetyNetMissingNetworkFilters from value to pointer receiver.
  • api/v1beta1/disruption_webhook_test.go: Set apiReader in all test
    setups that exercise safety nets. Add 6 new unit tests for
    safetyNetCountNotTooLarge.

Code Quality Checklist

  • The documentation is up to date.
  • My code is sufficiently commented and passes continuous integration checks.
  • I have signed my commit (see Contributing Docs).

Testing

  • I leveraged continuous integration testing
    • by depending on existing unit tests or end-to-end tests.
    • by adding new unit tests or end-to-end tests.
  • I manually tested the following steps:
    • Applied a Disruption resource that previously triggered the error.
    • locally.
    • as a canary deployment to a cluster.

New tests added

6 unit tests for safetyNetCountNotTooLarge covering:

  • Small fraction of pods (no trigger)
  • >80% of namespace pods (triggers)
  • >66% of cluster pods (triggers)
  • disableCountTooLarge bypass
  • Node-level disruption listing
  • No matching targets (early return)

@Zenithar Zenithar self-assigned this Apr 9, 2026
@Zenithar Zenithar marked this pull request as ready for review April 9, 2026 15:41
@Zenithar Zenithar requested a review from a team as a code owner April 9, 2026 15:41
@datadog-official
Copy link
Copy Markdown

datadog-official bot commented Apr 9, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 38.71%
Overall Coverage: 38.71% (+0.22%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 52fa84e | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@@ -71,6 +72,7 @@ func (d *Disruption) SetupWebhookWithManager(setupWebhookConfig utils.SetupWebho
tagutil.AdmissionControllerKey, "disruption-webhook",
)
k8sClient = setupWebhookConfig.Manager.GetClient()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any other place we are using the paginated list in the code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is the only place in the codebase using paginated listing (Limit/Continue).

The other List calls don't paginate:

Location What it lists Paginated?
targetselector/running_target_selector.go Pods/nodes by label selector No
controllers/disruption_controller.go All Disruption CRDs No
api/v1beta1/disruption_pods.go Chaos pods by label No
watchers/target_pod_handler.go Events by field selector No
controllers/cron_rollout_helpers.go Disruptions by label No
watchers/*.go DisruptionRollouts by field No

The safety net function is the only place that needs to count all pods/nodes cluster-wide (or namespace-wide without label filtering), which is why it's the only one that can hit a result set large enough to require pagination. The other List calls are scoped by label selectors, namespaces, or specific resource types (Disruption CRDs) that return small result sets.

@aymericDD
Copy link
Copy Markdown
Contributor

aymericDD commented Apr 13, 2026

Could you deploy this PR and test it and see if you noticed a performance improvement?

@Zenithar
Copy link
Copy Markdown
Contributor Author

Zenithar commented Apr 13, 2026

Could you deploy this PR and test it and see if you noticed a performance improvement?

It's not a performance improvement; it's a correctness improvement. Currently, the cached K8S API doesn't support pagination (per kubernetes-sigs/controller-runtime#3044). So if the validation requires a pagination continue, it will crash and prevent any disruption from being started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants