Skip to content

fix: retry connection when exporter is temporarily unavailable#829

Draft
evakhoni wants to merge 2 commits into
jumpstarter-dev:mainfrom
evakhoni:fix/handle-dead-exporter
Draft

fix: retry connection when exporter is temporarily unavailable#829
evakhoni wants to merge 2 commits into
jumpstarter-dev:mainfrom
evakhoni:fix/handle-dead-exporter

Conversation

@evakhoni

@evakhoni evakhoni commented Jun 23, 2026

Copy link
Copy Markdown
Member

When jmp shell connects to an exporter that was killed, suspended, or
network-partitioned, the initial GetReport RPC (inside client_from_channel)
fails with AioRpcError(UNAVAILABLE). Previously this surfaced as an
unhandled gRPC error. Now the client releases the lease and retries the full
connection cycle.

Changes

gRPC error conversion (client.py): client_from_path wraps
client_from_channel so that AioRpcError(UNAVAILABLE | DEADLINE_EXCEEDED)
from the initial GetReport call is converted to ExporterUnreachableError.
Other gRPC codes propagate unchanged.

Retry loop (shell.py): _shell_with_signal_handling wraps the
lease-acquire-and-connect block in a while True loop. On
ExporterUnreachableError, the lease context manager exits (releasing the
lease), a warning is logged, and the loop re-acquires. A connect_timeout
(default 5 minutes) bounds the total retry window. For auto-created leases
(--selector), the controller may assign a different exporter on each cycle.
For pre-created leases (--lease), the same exporter is retried until it
recovers or the timeout expires.

connect_timeout config (config/client.py, lease.py): New field on
ClientConfigV1Alpha1Lease (default 300s), passed through to the Lease
dataclass.

Overflow fix (lease.py): Cap 2 ** attempt to 2 ** min(attempt, 10)
in the Dial retry backoff to prevent OverflowError when sleep is mocked
in tests.

Behaviour

Scenario Outcome
Exporter killed, controller not yet aware Retries with lease release/re-acquire, up to 5m
Exporter recovers during retry window Succeeds on next attempt
Pre-created lease, exporter recovers Same exporter retried, succeeds when it comes back
Retry budget exhausted ExporterUnreachableError with elapsed time

Testing

  • client_test.py: GetReport error conversion for UNAVAILABLE, DEADLINE_EXCEEDED, and passthrough of other codes
  • shell_test.py: retry loop timeout exhaustion, and successful retry when exporter recovers
  • lease_test.py: fix MockAioRpcError internal attrs for pytest display

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 222d12e2-42e7-48c3-b2c8-fc94a97babf8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: retry connection when exporter is temporarily unavailable' directly and concisely summarizes the main change: adding automatic retry logic for temporarily unreachable exporters.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description matches the changeset: it explains exporter connection error conversion and shell retry behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@evakhoni evakhoni requested a review from mangelajo June 23, 2026 16:37
@evakhoni evakhoni force-pushed the fix/handle-dead-exporter branch from 0a61201 to 7142c52 Compare June 23, 2026 16:40

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/client/lease_test.py (1)

158-188: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Timing window is too tight and can make this test flaky.

Line 164 sets future_end to only 50ms ahead, so scheduler jitter can make the lease appear expired on first poll and intermittently fail the call_count >= 2 assertion at Line 185.

Suggested stabilization
-        future_end = datetime.now(tz=timezone.utc) + timedelta(milliseconds=50)
+        future_end = datetime.now(tz=timezone.utc) + timedelta(seconds=1)
@@
-            async with lease.monitor_async():
-                await asyncio.sleep(0.2)
+            async with lease.monitor_async():
+                await asyncio.sleep(1.2)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py` around lines
158 - 188, The timing window in the test_uses_cached_end_time_when_poll_fails
method is too tight, with future_end set only 50 milliseconds ahead which can
cause scheduler jitter to make the lease appear expired prematurely. Increase
the timedelta(milliseconds=50) value to a much larger duration (such as several
seconds) to provide a stable timing window that ensures the lease does not
expire during the asyncio.sleep(0.2) call and allows the call_count >= 2
assertion to reliably pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 298-303: The broad except Exception clause in the try block around
client.get_status_async() catches all errors including non-transient RPC
failures (like UNKNOWN or INVALID_ARGUMENT), converting them to
ExporterUnavailableError which causes unnecessary retries that mask the real
error. Instead of catching all exceptions, narrow the except clause to only
catch transient connection-level failures such as specific RPC error codes
(e.g., UNAVAILABLE, DEADLINE_EXCEEDED) or connection-related exceptions, and
allow other exceptions like DriverError from non-transient failures to propagate
unchanged.

In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 399-405: The try-except block in the connect_router_stream context
manager is catching all exceptions and wrapping them into
ExporterUnavailableError, which masks unrelated failures and cancellation
errors. Replace the broad Exception catch with specific exception types that
represent connection and transport failures, and re-raise any other exceptions
that are not related to router connectivity issues. This allows
lifecycle/cancellation errors and other root causes to propagate correctly
without being incorrectly classified as retryable exporter unavailability.

---

Outside diff comments:
In `@python/packages/jumpstarter/jumpstarter/client/lease_test.py`:
- Around line 158-188: The timing window in the
test_uses_cached_end_time_when_poll_fails method is too tight, with future_end
set only 50 milliseconds ahead which can cause scheduler jitter to make the
lease appear expired prematurely. Increase the timedelta(milliseconds=50) value
to a much larger duration (such as several seconds) to provide a stable timing
window that ensures the lease does not expire during the asyncio.sleep(0.2) call
and allows the call_count >= 2 assertion to reliably pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0be35727-d399-4af5-98ca-7fe5a02c5b38

📥 Commits

Reviewing files that changed from the base of the PR and between 4a5d98e and 7142c52.

📒 Files selected for processing (8)
  • python/packages/jumpstarter-cli/jumpstarter_cli/shell.py
  • python/packages/jumpstarter-cli/jumpstarter_cli/shell_test.py
  • python/packages/jumpstarter/jumpstarter/client/client.py
  • python/packages/jumpstarter/jumpstarter/client/client_test.py
  • python/packages/jumpstarter/jumpstarter/client/lease.py
  • python/packages/jumpstarter/jumpstarter/client/lease_test.py
  • python/packages/jumpstarter/jumpstarter/common/exceptions.py
  • python/packages/jumpstarter/jumpstarter/config/client.py

Comment on lines +298 to +303
try:
await client.get_status_async()
except Exception as e:
raise ExporterUnavailableError(
f"Exporter {lease.exporter_name} did not respond to initial status check"
) from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP --type=py -C5 'def get_status_async' 
rg -nP --type=py -C3 'get_status_async\(' python/packages/jumpstarter-cli

Repository: jumpstarter-dev/jumpstarter

Length of output: 165


🏁 Script executed:

# First, let's map the file structure and read the target lines
ast-grep outline python/packages/jumpstarter-cli/jumpstarter_cli/shell.py --view expanded | head -50

Repository: jumpstarter-dev/jumpstarter

Length of output: 1387


🏁 Script executed:

# Read the specific lines mentioned in the review (298-303)
sed -n '290,310p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 1502


🏁 Script executed:

# Search for get_status_async more broadly in the codebase
rg 'get_status_async' --type=py python/packages/jumpstarter-cli/ -n

Repository: jumpstarter-dev/jumpstarter

Length of output: 673


🏁 Script executed:

# Also search in proto files which might have gRPC method definitions
fd -e proto python/packages/jumpstarter-cli | head -20

Repository: jumpstarter-dev/jumpstarter

Length of output: 595


🏁 Script executed:

# Let's look at the context around lines 298-303 and the retry wrapper at 436-498
sed -n '280,320p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 2548


🏁 Script executed:

# Look at the retry wrapper implementation mentioned in the review
sed -n '436,500p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 2967


🏁 Script executed:

# Find where the client object comes from and what type it is
sed -n '267,300p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 1951


🏁 Script executed:

# Search for ExporterUnavailableError definition and other uses
rg 'ExporterUnavailableError' python/packages/jumpstarter-cli/ -n

Repository: jumpstarter-dev/jumpstarter

Length of output: 1544


🏁 Script executed:

# Look for proto files or gRPC client definitions
fd . python/packages/jumpstarter-cli -name "*.proto" -o -name "*pb2*" -o -name "*_pb2*"

Repository: jumpstarter-dev/jumpstarter

Length of output: 396


🏁 Script executed:

# Find where client_from_path is defined to understand the client type
rg 'def client_from_path' python/packages/jumpstarter-cli/ -n -A 5

Repository: jumpstarter-dev/jumpstarter

Length of output: 165


🏁 Script executed:

# Search for get_status_async definition across the entire codebase
rg 'get_status_async' python/ -n -A 3 | head -100

Repository: jumpstarter-dev/jumpstarter

Length of output: 7345


🏁 Script executed:

# Look for gRPC stub definitions or client implementations
fd --full-path 'python/packages/jumpstarter-cli' -name '*.py' -type f | xargs grep -l 'get_status' | head -10

Repository: jumpstarter-dev/jumpstarter

Length of output: 396


🏁 Script executed:

# Check imports in shell.py to understand where exceptions come from
sed -n '1,50p' python/packages/jumpstarter-cli/jumpstarter_cli/shell.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 1789


🏁 Script executed:

# Let's read the actual implementation of get_status_async() to see what exceptions it can raise
sed -n '103,150p' python/packages/jumpstarter/jumpstarter/client/core.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 2156


🏁 Script executed:

# Check the error handling in get_status_async more closely
sed -n '103,180p' python/packages/jumpstarter/jumpstarter/client/core.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 3470


🏁 Script executed:

# Look for RPC error handling patterns in the client
rg 'RpcError|StatusCode' python/packages/jumpstarter/jumpstarter/client/core.py -n -A 2 -B 2

Repository: jumpstarter-dev/jumpstarter

Length of output: 4965


🏁 Script executed:

# Check what exception types are used in the common exceptions module
rg 'class ExporterUnavailableError|class ConnectionError' python/packages/jumpstarter/jumpstarter/common/ -n -A 5

Repository: jumpstarter-dev/jumpstarter

Length of output: 1268


🏁 Script executed:

# Check if there are any other exceptions that could be raised from get_status_async or its dependencies
rg 'raise.*Error' python/packages/jumpstarter/jumpstarter/client/core.py -n -B 2 | head -50

Repository: jumpstarter-dev/jumpstarter

Length of output: 1396


🏁 Script executed:

# Look for all exception types that could be raised in the status monitoring context
rg 'class.*Error' python/packages/jumpstarter/jumpstarter/common/exceptions.py -n

Repository: jumpstarter-dev/jumpstarter

Length of output: 651


🏁 Script executed:

# Check the tests to understand what exceptions are expected from get_status_async
sed -n '130,165p' python/packages/jumpstarter/jumpstarter/client/core_test.py

Repository: jumpstarter-dev/jumpstarter

Length of output: 1506


Broad except Exception converts both transient and non-transient RPC errors to retryable failures.

The specific concern about unsupported exporters is unfounded—get_status_async() returns None for UNIMPLEMENTED RPC errors, not an exception. However, a real issue exists with other RPC errors: when get_status_async() encounters non-transient gRPC failures (e.g., UNKNOWN, INVALID_ARGUMENT), it raises DriverError. Converting all exceptions to ExporterUnavailableError causes these non-transient errors to be retried for the full connect_retry_timeout (default 120s) before failing with "did not become available", masking the real error.

Narrow the catch block to only transient connection-level failures (e.g., specific RPC error codes or connection/unavailability errors) and let other exceptions propagate unchanged.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py` around lines 298 -
303, The broad except Exception clause in the try block around
client.get_status_async() catches all errors including non-transient RPC
failures (like UNKNOWN or INVALID_ARGUMENT), converting them to
ExporterUnavailableError which causes unnecessary retries that mask the real
error. Instead of catching all exceptions, narrow the except clause to only
catch transient connection-level failures such as specific RPC error codes
(e.g., UNAVAILABLE, DEADLINE_EXCEEDED) or connection-related exceptions, and
allow other exceptions like DriverError from non-transient failures to propagate
unchanged.

Comment on lines +399 to +405
try:
async with connect_router_stream(
response.router_endpoint, response.router_token, stream, self.tls_config, self.grpc_options
):
pass
except Exception as e:
raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not classify every router-stream exception as retryable exporter unavailability.

This catch-all wraps unrelated failures (and lifecycle/cancellation errors) into ExporterUnavailableError, which can trigger incorrect outer retries and hide root causes. Restrict wrapping to connection/transport failures and re-raise everything else.

Suggested fix
-        except Exception as e:
-            raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e
+        except (AioRpcError, OSError) as e:
+            raise ExporterUnavailableError(f"Router connection failed for lease {self.name}") from e
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/packages/jumpstarter/jumpstarter/client/lease.py` around lines 399 -
405, The try-except block in the connect_router_stream context manager is
catching all exceptions and wrapping them into ExporterUnavailableError, which
masks unrelated failures and cancellation errors. Replace the broad Exception
catch with specific exception types that represent connection and transport
failures, and re-raise any other exceptions that are not related to router
connectivity issues. This allows lifecycle/cancellation errors and other root
causes to propagate correctly without being incorrectly classified as retryable
exporter unavailability.

@evakhoni

Copy link
Copy Markdown
Member Author

wait. this have a regression with broken retry mid-session.

@mangelajo mangelajo added this to the 0.9.0 milestone Jun 24, 2026
@evakhoni evakhoni force-pushed the fix/handle-dead-exporter branch 3 times, most recently from 5e5e92e to 19e3d71 Compare June 24, 2026 21:51
evakhoni and others added 2 commits June 25, 2026 00:56
When `jmp shell` connects to an exporter that was killed or suspended,
GetReport fails with UNAVAILABLE. Convert this to ExporterUnreachableError
in client_from_path so the retry loop in _shell_with_signal_handling can
release the lease and re-acquire, eventually reaching a live exporter.

Add connect_timeout (default 5m) to bound the retry window for both
auto-created and pre-created leases. Also fix 2**attempt overflow in
the Dial retry loop by capping the exponent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- client_test: verify client_from_path converts UNAVAILABLE and
  DEADLINE_EXCEEDED to ExporterUnreachableError, other codes propagate
- shell_test: verify retry loop times out with connect_timeout, and
  succeeds when exporter recovers before timeout
- lease_test: fix MockAioRpcError attrs for pytest display, cap
  2**attempt in retry tests to prevent overflow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@evakhoni evakhoni force-pushed the fix/handle-dead-exporter branch from 19e3d71 to f3051a8 Compare June 24, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants