fix: handle None return from get_container_runtime()#159
fix: handle None return from get_container_runtime()#159Rahuldrabit wants to merge 2 commits intomicrosoft:mainfrom
Conversation
The get_container_runtime() method was returning None when called immediately after cluster creation, before any nodes reached Ready state. This caused a TypeError: argument of type 'NoneType' is not iterable when the code tried to check if 'docker' was in the runtime string. Changes: - Add retry logic with 60s timeout to get_container_runtime() to wait for at least one node to become Ready - Add explicit None check in SymptomFaultInjector.__init__() with a clear error message if container runtime cannot be detected Fixes network_delay, pod_failure, and other chaos mesh injection tasks that were crashing on initialization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR prevents chaos-mesh fault injection tasks from crashing during early cluster startup by making container runtime detection resilient to clusters with no Ready nodes yet.
Changes:
- Add polling/retry behavior to
KubeCtl.get_container_runtime()to wait for a Ready node before reading its runtime. - Add an explicit
Nonecheck inSymptomFaultInjector.__init__()to raise a clearer error instead of triggering aTypeError.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
aiopslab/service/kubectl.py |
Adds bounded retry logic for detecting container runtime from a Ready node. |
aiopslab/generators/fault/inject_symp.py |
Validates runtime detection result and raises a clear error when unavailable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| elapsed = 0 | ||
| while elapsed < max_wait: | ||
| for node in self.core_v1_api.list_node().items: | ||
| for status in node.status.conditions: | ||
| if status.type == "Ready" and status.status == "True": | ||
| return node.status.node_info.container_runtime_version | ||
| time.sleep(poll_interval) | ||
| elapsed += poll_interval |
There was a problem hiding this comment.
The retry loop can exceed max_wait when poll_interval is larger than the remaining time (e.g., max_wait=1, poll_interval=10 will still sleep 10s). Consider using a monotonic deadline and sleeping min(poll_interval, remaining) (or validating/clamping poll_interval) so max_wait is an actual upper bound.
| elapsed = 0 | |
| while elapsed < max_wait: | |
| for node in self.core_v1_api.list_node().items: | |
| for status in node.status.conditions: | |
| if status.type == "Ready" and status.status == "True": | |
| return node.status.node_info.container_runtime_version | |
| time.sleep(poll_interval) | |
| elapsed += poll_interval | |
| deadline = time.monotonic() + max_wait | |
| while time.monotonic() < deadline: | |
| for node in self.core_v1_api.list_node().items: | |
| for status in node.status.conditions: | |
| if status.type == "Ready" and status.status == "True": | |
| return node.status.node_info.container_runtime_version | |
| remaining = deadline - time.monotonic() | |
| if remaining <= 0: | |
| break | |
| time.sleep(min(poll_interval, remaining)) |
There was a problem hiding this comment.
Might be an extreme case but a good practice
| while elapsed < max_wait: | ||
| for node in self.core_v1_api.list_node().items: | ||
| for status in node.status.conditions: | ||
| if status.type == "Ready" and status.status == "True": | ||
| return node.status.node_info.container_runtime_version | ||
| time.sleep(poll_interval) | ||
| elapsed += poll_interval |
There was a problem hiding this comment.
The docstring says this returns None when no Ready node is found, but self.core_v1_api.list_node() can raise (e.g., transient API errors / auth / connectivity) and will currently break out of the retry loop immediately. Either update the docstring to document the possible exception(s), or catch/log exceptions during polling and continue until the deadline (similar to wait_for_ready).
There was a problem hiding this comment.
Good point. We should catch exceptions and update docstrings
|
before fix after fix |
Address Copilot AI review feedback: 1. Fix timing precision: Use time.monotonic() deadline and sleep only remaining time to ensure max_wait is an actual upper bound. Previous implementation could exceed max_wait by up to poll_interval. 2. Add exception handling: Catch transient API errors during polling and continue retrying until deadline, similar to wait_for_ready(). Log warnings for persistent errors. 3. Update docstring to document exception handling behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Timing precision fixed: Switched to Exception handling added: Now catching transient API errors (connectivity, auth, etc.) and continuing to retry until deadline, similar to the existing Updated the docstring to document the exception handling behavior. Changes pushed in commit fbfa9f5. |
|
Hi @Rahuldrabit , thanks for this PR! It looks great (I commented on the copilot review without realizing you have already addressed them.
|
@microsoft-github-policy-service agree |
Thank you @gaganso for your review . I have agreed to Contributor License Agreement already . I don't know why the aiopslab-applications files are part of the PR ? , can you please tell me why ? and which are the aiopslab application files |
Description
Fixes a TypeError crash that occurred when running network_delay and other chaos mesh injection tasks immediately after cluster creation.
Problem
The
get_container_runtime()method was returningNonewhen called before any Kubernetes nodes reached "Ready" state. This caused:at line 30 in
inject_symp.pywhen checkingif 'docker' in container_runtime.Solution
get_container_runtime(): Waits up to 60 seconds (configurable) for at least one node to become ReadySymptomFaultInjector.__init__(): Provides clear error message if container runtime cannot be detectedTesting
network_delay_hotel_res-detection-1taskAffected Tasks
All chaos mesh-based fault injection tasks including:
Checklist