fix(webhook): fix cache error on k8s client pagination. by Zenithar · Pull Request #1059 · DataDog/chaos-controller

Zenithar · 2026-04-09T15:35:18Z

What does this PR do?

Adds new functionality
Alters existing functionality
Fixes a bug
Improves documentation or testing

Summary

Fixes the webhook admission error:

"chaos-controller.chaos-engineering.svc" denied the request: error checking for
countNotTooLarge safetynet: error listing target pods: continue list option is
not supported by the cache

Root Cause

safetyNetCountNotTooLarge uses paginated listing (Limit/Continue) with the
controller-runtime cached client (Manager.GetClient()). The cache has two
known issues (kubernetes-sigs/controller-runtime#3044):

Continue is not supported — the cache rejects it with a hard error.
When Limit is set, the cache silently returns an incomplete list
truncated at the limit, without signaling that the list is incomplete.

This means neither paginated nor unpaginated listing works correctly with the
cached client for counting resources.

Fix

Use mgr.GetAPIReader() — a direct API server client that supports proper
pagination — for the List calls in safetyNetCountNotTooLarge.

Changes:

utils/utils.go: Add APIReader client.Reader field to
SetupWebhookWithManagerConfig.
main.go: Pass mgr.GetAPIReader() in the webhook config
(GetAPIReader() is already used for watchers).
api/v1beta1/disruption_webhook.go:
- Add apiReader package-level variable, initialized from config.
- Restore paginated List calls (Limit: 1000 + Continue loops)
  using apiReader instead of the cached k8sClient.
- Fix incorrect error message "error listing target pods" → "error listing
  nodes" in the node-level branch.
- Pass ctx from the admission handler instead of context.Background().
- Add nil guards to initialSafetyNets and safetyNetMissingNetworkFilters.
- Change safetyNetMissingNetworkFilters from value to pointer receiver.
api/v1beta1/disruption_webhook_test.go: Set apiReader in all test
setups that exercise safety nets. Add 6 new unit tests for
safetyNetCountNotTooLarge.

Code Quality Checklist

The documentation is up to date.
My code is sufficiently commented and passes continuous integration checks.
I have signed my commit (see Contributing Docs).

Testing

I leveraged continuous integration testing
- by depending on existing unit tests or end-to-end tests.
- by adding new unit tests or end-to-end tests.
I manually tested the following steps:
- Applied a Disruption resource that previously triggered the error.
- locally.
- as a canary deployment to a cluster.

New tests added

6 unit tests for safetyNetCountNotTooLarge covering:

Small fraction of pods (no trigger)
>80% of namespace pods (triggers)
>66% of cluster pods (triggers)
disableCountTooLarge bypass
Node-level disruption listing
No matching targets (early return)

datadog-official · 2026-04-09T15:41:41Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 38.71%
• Overall Coverage: 38.71% (+0.22%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 52fa84e | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!}

clairecng · 2026-04-13T08:04:39Z

api/v1beta1/disruption_webhook.go

@@ -71,6 +72,7 @@ func (d *Disruption) SetupWebhookWithManager(setupWebhookConfig utils.SetupWebho
 		tagutil.AdmissionControllerKey, "disruption-webhook",
 	)
 	k8sClient = setupWebhookConfig.Manager.GetClient()


is there any other place we are using the paginated list in the code?

No, this is the only place in the codebase using paginated listing (Limit/Continue).

The other List calls don't paginate:

Location What it lists Paginated?

targetselector/running_target_selector.go Pods/nodes by label selector No

controllers/disruption_controller.go All Disruption CRDs No

api/v1beta1/disruption_pods.go Chaos pods by label No

watchers/target_pod_handler.go Events by field selector No

controllers/cron_rollout_helpers.go Disruptions by label No

watchers/*.go DisruptionRollouts by field No

The safety net function is the only place that needs to count all pods/nodes cluster-wide (or namespace-wide without label filtering), which is why it's the only one that can hit a result set large enough to require pagination. The other List calls are scoped by label selectors, namespaces, or specific resource types (Disruption CRDs) that return small result sets.

aymericDD · 2026-04-13T08:12:47Z

Could you deploy this PR and test it and see if you noticed a performance improvement?

Zenithar · 2026-04-13T09:25:35Z

Could you deploy this PR and test it and see if you noticed a performance improvement?

It's not a performance improvement; it's a correctness improvement. Currently, the cached K8S API doesn't support pagination (per kubernetes-sigs/controller-runtime#3044). So if the validation requires a pagination continue, it will crash and prevent any disruption from being started.

…ndtrip data size.

fix(webhook): fix cache error on k8s client pagination.

b60a27c

Zenithar self-assigned this Apr 9, 2026

Zenithar marked this pull request as ready for review April 9, 2026 15:41

Zenithar requested a review from a team as a code owner April 9, 2026 15:41

fix(webhook): use APIReader client without cache.

b29490c

clairecng reviewed Apr 13, 2026

View reviewed changes

clairecng approved these changes Apr 13, 2026

View reviewed changes

refactor(webhook): use metav1.PartialObjectMetadataList to reduce rou…

52fa84e

…ndtrip data size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(webhook): fix cache error on k8s client pagination.#1059

fix(webhook): fix cache error on k8s client pagination.#1059
Zenithar wants to merge 3 commits intomainfrom
zenithar/chaos-controller/fix_safetyNetCountTooLarge_error

Zenithar commented Apr 9, 2026 •

edited

Loading

Uh oh!

datadog-official bot commented Apr 9, 2026 •

edited by datadog-prod-us1-6 bot

Loading

Uh oh!

clairecng Apr 13, 2026

Uh oh!

Zenithar Apr 13, 2026

Uh oh!

aymericDD commented Apr 13, 2026 •

edited

Loading

Uh oh!

Zenithar commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Location	What it lists	Paginated?
`targetselector/running_target_selector.go`	Pods/nodes by label selector	No
`controllers/disruption_controller.go`	All Disruption CRDs	No
`api/v1beta1/disruption_pods.go`	Chaos pods by label	No
`watchers/target_pod_handler.go`	Events by field selector	No
`controllers/cron_rollout_helpers.go`	Disruptions by label	No
`watchers/*.go`	DisruptionRollouts by field	No

Conversation

Zenithar commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Root Cause

Fix

Code Quality Checklist

Testing

New tests added

Uh oh!

datadog-official bot commented Apr 9, 2026 • edited by datadog-prod-us1-6 bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clairecng Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Zenithar Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

aymericDD commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zenithar commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Zenithar commented Apr 9, 2026 •

edited

Loading

datadog-official bot commented Apr 9, 2026 •

edited by datadog-prod-us1-6 bot

Loading

aymericDD commented Apr 13, 2026 •

edited

Loading

Zenithar commented Apr 13, 2026 •

edited

Loading