fix: retry connection when exporter is temporarily unavailable#829
fix: retry connection when exporter is temporarily unavailable#829evakhoni wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
0a61201 to
7142c52
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/client/lease_test.py (1)
158-188: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick winTiming window is too tight and can make this test flaky.
Line 164 sets
future_endto only 50ms ahead, so scheduler jitter can make the lease appear expired on first poll and intermittently fail thecall_count >= 2assertion at Line 185.Suggested stabilization
- future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=50) + future_end = datetime.now(tz=timezone.utc) + timedelta(seconds=1) @@ - async with lease.monitor_async(): - await asyncio.sleep(0.2) + async with lease.monitor_async(): + await asyncio.sleep(1.2)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py` around lines 158 - 188, The timing window in the test_uses_cached_end_time_when_poll_fails method is too tight, with future_end set only 50 milliseconds ahead which can cause scheduler jitter to make the lease appear expired prematurely. Increase the timedelta(milliseconds=50) value to a much larger duration (such as several seconds) to provide a stable timing window that ensures the lease does not expire during the asyncio.sleep(0.2) call and allows the call_count >= 2 assertion to reliably pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 298-303: The broad except Exception clause in the try block around
client.get_status_async() catches all errors including non-transient RPC
failures (like UNKNOWN or INVALID_ARGUMENT), converting them to
ExporterUnavailableError which causes unnecessary retries that mask the real
error. Instead of catching all exceptions, narrow the except clause to only
catch transient connection-level failures such as specific RPC error codes
(e.g., UNAVAILABLE, DEADLINE_EXCEEDED) or connection-related exceptions, and
allow other exceptions like DriverError from non-transient failures to propagate
unchanged.
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 399-405: The try-except block in the connect_router_stream context
manager is catching all exceptions and wrapping them into
ExporterUnavailableError, which masks unrelated failures and cancellation
errors. Replace the broad Exception catch with specific exception types that
represent connection and transport failures, and re-raise any other exceptions
that are not related to router connectivity issues. This allows
lifecycle/cancellation errors and other root causes to propagate correctly
without being incorrectly classified as retryable exporter unavailability.
---
Outside diff comments:
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 158-188: The timing window in the
test_uses_cached_end_time_when_poll_fails method is too tight, with future_end
set only 50 milliseconds ahead which can cause scheduler jitter to make the
lease appear expired prematurely. Increase the timedelta(milliseconds=50) value
to a much larger duration (such as several seconds) to provide a stable timing
window that ensures the lease does not expire during the asyncio.sleep(0.2) call
and allows the call_count >= 2 assertion to reliably pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0be35727-d399-4af5-98ca-7fe5a02c5b38
📒 Files selected for processing (8)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.pypython/packages/jumpstarter-cli/jumpstarter_cli/shell_test.pypython/packages/jumpstarter/jumpstarter/client/client.pypython/packages/jumpstarter/jumpstarter/client/client_test.pypython/packages/jumpstarter/jumpstarter/client/lease.pypython/packages/jumpstarter/jumpstarter/client/lease_test.pypython/packages/jumpstarter/jumpstarter/common/exceptions.pypython/packages/jumpstarter/jumpstarter/config/client.py
| try: | ||
| await client.get_status_async() | ||
| except Exception as e: | ||
| raise ExporterUnavailableError( | ||
| f"Exporter {lease.exporter_name} did not respond to initial status check" | ||
| ) from e |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP --type=py -C5 'def get_status_async'
rg -nP --type=py -C3 'get_status_async\(' python/packages/jumpstarter-cliRepository: jumpstarter-dev/jumpstarter
Length of output: 165
🏁 Script executed:
# First, let's map the file structure and read the target lines
ast-grep outline python/packages/jumpstarter-cli/jumpstarter_cli/shell.py --view expanded | head -50Repository: jumpstarter-dev/jumpstarter
Length of output: 1387
🏁 Script executed:
# Read the specific lines mentioned in the review (298-303)
sed -n '290,310p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 1502
🏁 Script executed:
# Search for get_status_async more broadly in the codebase
rg 'get_status_async' --type=py python/packages/jumpstarter-cli/ -nRepository: jumpstarter-dev/jumpstarter
Length of output: 673
🏁 Script executed:
# Also search in proto files which might have gRPC method definitions
fd -e proto python/packages/jumpstarter-cli | head -20Repository: jumpstarter-dev/jumpstarter
Length of output: 595
🏁 Script executed:
# Let's look at the context around lines 298-303 and the retry wrapper at 436-498
sed -n '280,320p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 2548
🏁 Script executed:
# Look at the retry wrapper implementation mentioned in the review
sed -n '436,500p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 2967
🏁 Script executed:
# Find where the client object comes from and what type it is
sed -n '267,300p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 1951
🏁 Script executed:
# Search for ExporterUnavailableError definition and other uses
rg 'ExporterUnavailableError' python/packages/jumpstarter-cli/ -nRepository: jumpstarter-dev/jumpstarter
Length of output: 1544
🏁 Script executed:
# Look for proto files or gRPC client definitions
fd . python/packages/jumpstarter-cli -name "*.proto" -o -name "*pb2*" -o -name "*_pb2*"Repository: jumpstarter-dev/jumpstarter
Length of output: 396
🏁 Script executed:
# Find where client_from_path is defined to understand the client type
rg 'def client_from_path' python/packages/jumpstarter-cli/ -n -A 5Repository: jumpstarter-dev/jumpstarter
Length of output: 165
🏁 Script executed:
# Search for get_status_async definition across the entire codebase
rg 'get_status_async' python/ -n -A 3 | head -100Repository: jumpstarter-dev/jumpstarter
Length of output: 7345
🏁 Script executed:
# Look for gRPC stub definitions or client implementations
fd --full-path 'python/packages/jumpstarter-cli' -name '*.py' -type f | xargs grep -l 'get_status' | head -10Repository: jumpstarter-dev/jumpstarter
Length of output: 396
🏁 Script executed:
# Check imports in shell.py to understand where exceptions come from
sed -n '1,50p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 1789
🏁 Script executed:
# Let's read the actual implementation of get_status_async() to see what exceptions it can raise
sed -n '103,150p' python/packages/jumpstarter/jumpstarter/client/core.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 2156
🏁 Script executed:
# Check the error handling in get_status_async more closely
sed -n '103,180p' python/packages/jumpstarter/jumpstarter/client/core.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 3470
🏁 Script executed:
# Look for RPC error handling patterns in the client
rg 'RpcError|StatusCode' python/packages/jumpstarter/jumpstarter/client/core.py -n -A 2 -B 2Repository: jumpstarter-dev/jumpstarter
Length of output: 4965
🏁 Script executed:
# Check what exception types are used in the common exceptions module
rg 'class ExporterUnavailableError|class ConnectionError' python/packages/jumpstarter/jumpstarter/common/ -n -A 5Repository: jumpstarter-dev/jumpstarter
Length of output: 1268
🏁 Script executed:
# Check if there are any other exceptions that could be raised from get_status_async or its dependencies
rg 'raise.*Error' python/packages/jumpstarter/jumpstarter/client/core.py -n -B 2 | head -50Repository: jumpstarter-dev/jumpstarter
Length of output: 1396
🏁 Script executed:
# Look for all exception types that could be raised in the status monitoring context
rg 'class.*Error' python/packages/jumpstarter/jumpstarter/common/exceptions.py -nRepository: jumpstarter-dev/jumpstarter
Length of output: 651
🏁 Script executed:
# Check the tests to understand what exceptions are expected from get_status_async
sed -n '130,165p' python/packages/jumpstarter/jumpstarter/client/core_test.pyRepository: jumpstarter-dev/jumpstarter
Length of output: 1506
Broad except Exception converts both transient and non-transient RPC errors to retryable failures.
The specific concern about unsupported exporters is unfounded—get_status_async() returns None for UNIMPLEMENTED RPC errors, not an exception. However, a real issue exists with other RPC errors: when get_status_async() encounters non-transient gRPC failures (e.g., UNKNOWN, INVALID_ARGUMENT), it raises DriverError. Converting all exceptions to ExporterUnavailableError causes these non-transient errors to be retried for the full connect_retry_timeout (default 120s) before failing with "did not become available", masking the real error.
Narrow the catch block to only transient connection-level failures (e.g., specific RPC error codes or connection/unavailability errors) and let other exceptions propagate unchanged.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py` around lines 298 -
303, The broad except Exception clause in the try block around
client.get_status_async() catches all errors including non-transient RPC
failures (like UNKNOWN or INVALID_ARGUMENT), converting them to
ExporterUnavailableError which causes unnecessary retries that mask the real
error. Instead of catching all exceptions, narrow the except clause to only
catch transient connection-level failures such as specific RPC error codes
(e.g., UNAVAILABLE, DEADLINE_EXCEEDED) or connection-related exceptions, and
allow other exceptions like DriverError from non-transient failures to propagate
unchanged.
| try: | ||
| async with connect_router_stream( | ||
| response.router_endpoint, response.router_token, stream, self.tls_config, self.grpc_options | ||
| ): | ||
| pass | ||
| except Exception as e: | ||
| raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Do not classify every router-stream exception as retryable exporter unavailability.
This catch-all wraps unrelated failures (and lifecycle/cancellation errors) into ExporterUnavailableError, which can trigger incorrect outer retries and hide root causes. Restrict wrapping to connection/transport failures and re-raise everything else.
Suggested fix
- except Exception as e:
- raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e
+ except (AioRpcError, OSError) as e:
+ raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/packages/jumpstarter/jumpstarter/client/lease.py` around lines 399 -
405, The try-except block in the connect_router_stream context manager is
catching all exceptions and wrapping them into ExporterUnavailableError, which
masks unrelated failures and cancellation errors. Replace the broad Exception
catch with specific exception types that represent connection and transport
failures, and re-raise any other exceptions that are not related to router
connectivity issues. This allows lifecycle/cancellation errors and other root
causes to propagate correctly without being incorrectly classified as retryable
exporter unavailability.
|
wait. this have a regression with broken retry mid-session. |
5e5e92e to
19e3d71
Compare
When `jmp shell` connects to an exporter that was killed or suspended, GetReport fails with UNAVAILABLE. Convert this to ExporterUnreachableError in client_from_path so the retry loop in _shell_with_signal_handling can release the lease and re-acquire, eventually reaching a live exporter. Add connect_timeout (default 5m) to bound the retry window for both auto-created and pre-created leases. Also fix 2**attempt overflow in the Dial retry loop by capping the exponent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- client_test: verify client_from_path converts UNAVAILABLE and DEADLINE_EXCEEDED to ExporterUnreachableError, other codes propagate - shell_test: verify retry loop times out with connect_timeout, and succeeds when exporter recovers before timeout - lease_test: fix MockAioRpcError attrs for pytest display, cap 2**attempt in retry tests to prevent overflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19e3d71 to
f3051a8
Compare
When
jmp shellconnects to an exporter that was killed, suspended, ornetwork-partitioned, the initial
GetReportRPC (insideclient_from_channel)fails with
AioRpcError(UNAVAILABLE). Previously this surfaced as anunhandled gRPC error. Now the client releases the lease and retries the full
connection cycle.
Changes
gRPC error conversion (
client.py):client_from_pathwrapsclient_from_channelso thatAioRpcError(UNAVAILABLE | DEADLINE_EXCEEDED)from the initial
GetReportcall is converted toExporterUnreachableError.Other gRPC codes propagate unchanged.
Retry loop (
shell.py):_shell_with_signal_handlingwraps thelease-acquire-and-connect block in a
while Trueloop. OnExporterUnreachableError, the lease context manager exits (releasing thelease), a warning is logged, and the loop re-acquires. A
connect_timeout(default 5 minutes) bounds the total retry window. For auto-created leases
(
--selector), the controller may assign a different exporter on each cycle.For pre-created leases (
--lease), the same exporter is retried until itrecovers or the timeout expires.
connect_timeoutconfig (config/client.py,lease.py): New field onClientConfigV1Alpha1Lease(default 300s), passed through to theLeasedataclass.
Overflow fix (
lease.py): Cap2 ** attemptto2 ** min(attempt, 10)in the Dial retry backoff to prevent
OverflowErrorwhensleepis mockedin tests.
Behaviour
ExporterUnreachableErrorwith elapsed timeTesting
client_test.py:GetReporterror conversion for UNAVAILABLE, DEADLINE_EXCEEDED, and passthrough of other codesshell_test.py: retry loop timeout exhaustion, and successful retry when exporter recoverslease_test.py: fixMockAioRpcErrorinternal attrs for pytest display