fix(webhook): fix cache error on k8s client pagination.#1059
fix(webhook): fix cache error on k8s client pagination.#1059
Conversation
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 52fa84e | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
| @@ -71,6 +72,7 @@ func (d *Disruption) SetupWebhookWithManager(setupWebhookConfig utils.SetupWebho | |||
| tagutil.AdmissionControllerKey, "disruption-webhook", | |||
| ) | |||
| k8sClient = setupWebhookConfig.Manager.GetClient() | |||
There was a problem hiding this comment.
is there any other place we are using the paginated list in the code?
There was a problem hiding this comment.
No, this is the only place in the codebase using paginated listing (Limit/Continue).
The other List calls don't paginate:
| Location | What it lists | Paginated? |
|---|---|---|
targetselector/running_target_selector.go |
Pods/nodes by label selector | No |
controllers/disruption_controller.go |
All Disruption CRDs | No |
api/v1beta1/disruption_pods.go |
Chaos pods by label | No |
watchers/target_pod_handler.go |
Events by field selector | No |
controllers/cron_rollout_helpers.go |
Disruptions by label | No |
watchers/*.go |
DisruptionRollouts by field | No |
The safety net function is the only place that needs to count all pods/nodes cluster-wide (or namespace-wide without label filtering), which is why it's the only one that can hit a result set large enough to require pagination. The other List calls are scoped by label selectors, namespaces, or specific resource types (Disruption CRDs) that return small result sets.
|
Could you deploy this PR and test it and see if you noticed a performance improvement? |
It's not a performance improvement; it's a correctness improvement. Currently, the cached K8S API doesn't support pagination (per kubernetes-sigs/controller-runtime#3044). So if the validation requires a pagination continue, it will crash and prevent any disruption from being started. |
…ndtrip data size.
What does this PR do?
Summary
Fixes the webhook admission error:
Root Cause
safetyNetCountNotTooLargeuses paginated listing (Limit/Continue) with thecontroller-runtime cached client (
Manager.GetClient()). The cache has twoknown issues (kubernetes-sigs/controller-runtime#3044):
Continueis not supported — the cache rejects it with a hard error.Limitis set, the cache silently returns an incomplete listtruncated at the limit, without signaling that the list is incomplete.
This means neither paginated nor unpaginated listing works correctly with the
cached client for counting resources.
Fix
Use
mgr.GetAPIReader()— a direct API server client that supports properpagination — for the List calls in
safetyNetCountNotTooLarge.Changes:
utils/utils.go: AddAPIReader client.Readerfield toSetupWebhookWithManagerConfig.main.go: Passmgr.GetAPIReader()in the webhook config(
GetAPIReader()is already used for watchers).api/v1beta1/disruption_webhook.go:apiReaderpackage-level variable, initialized from config.Listcalls (Limit: 1000+Continueloops)using
apiReaderinstead of the cachedk8sClient.nodes" in the node-level branch.
ctxfrom the admission handler instead ofcontext.Background().initialSafetyNetsandsafetyNetMissingNetworkFilters.safetyNetMissingNetworkFiltersfrom value to pointer receiver.api/v1beta1/disruption_webhook_test.go: SetapiReaderin all testsetups that exercise safety nets. Add 6 new unit tests for
safetyNetCountNotTooLarge.Code Quality Checklist
Testing
unittests orend-to-endtests.unittests orend-to-endtests.Disruptionresource that previously triggered the error.New tests added
6 unit tests for
safetyNetCountNotTooLargecovering:disableCountTooLargebypass