ACM-35158 avoid blocking incoming requests [release-2.16]#6339
ACM-35158 avoid blocking incoming requests [release-2.16]#6339KevinFCormier wants to merge 2 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: stolostron/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…oming network requests Assisted-by: Cursor (Claude Opus 4.6 High) Signed-off-by: Kevin Cormier <kcormier@redhat.com>
|
/cc @fxiang1 |
Generated-by: Cursor (Claude Opus 4.6 High) Signed-off-by: Kevin Cormier <kcormier@redhat.com>
b975c63 to
d1b5b32
Compare
|
/test unit-tests-sonarcloud |
|
/lgtm I put this on hold, not sure if the 2.16 stream is open. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fxiang1, KevinFCormier The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|



📝 Summary
Ticket Summary (Title):
console-mce pod CrashLoopBackOff due to probe failure with large resource sets
Ticket Link:
https://redhat.atlassian.net/browse/ACM-35499
Type of Change:
✅ Checklist
General
ACM-12340 Fix bug with...)If Bugfix
🗒️ Notes for Reviewers
Customer was observing CrashLoopBackoff on console-mce pods and had over 20,000 Group resources present. Initially processing these resources may have blocked us from responding to the readiness and liveness probes on time, causing Kubernetes to restart the pod.
This PR batches Promise.all calls so that we call setImmediate after each batch of 100 to yield the event queue so that new requests can be handled. This seems to smooth out memory usage of the pods. Before this change, I observed a spike at startup, then a retreat to stable size.
Memory Usage
In the following videos, I deleted pods, then observed the memory usage of the replacements. You can see that before the patch, the pods sometimes have a memory spike during startup, and in this case we see one of the pods spike to almost 12 GB, before falling back to a more typical value. After the patch, there is no initial spike.
Before
Before.Patch.mov
After
After.Patch.mov
Liveness Probe Response Time
I also checked the response time, turning on garbage collection tracing and using a simple script to repeatedly check the liveness probe endpoint. I tested against cluster
kevin-probe-testwhich has 50,000 Group resources on it.Here is the process I followed:
./check-response-time.shor./check-response-time.sh -t 1.0in one terminal. (The latter command shows only when response time is greater than 1.0 seconds.)npm run plugins.https://localhost:9000/multicloud/credentialsin several tabs. Repeatedly refresh the tabs to drive new full loads of the SSE stream.Using a baseline test branch vs. a patched test branch, I found:
./check-response-time.sh -t 0.1and only saw response times up to just under 0.5 s.