fix(hooks): eliminate macOS PTY output race condition#837
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdates PTY hook execution and drain handling, adjusts the PTY tests to match the new behavior, and expands the pull request Python test matrix. ChangesPTY Drain Logic Simplification
PTY Drain Test Updates
Python Test Matrix Update
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/python-tests.yaml:
- Around line 44-45: Update the stale matrix comments in the workflow so they
match the current behavior. The comments near the matrix setup and downstream
job reference only Python 3.12 on Linux, but the actual matrix in the workflow
now includes Python 3.11, 3.12, and 3.13 on both ubuntu-24.04 and macos-15.
Adjust the comments in the relevant workflow sections to accurately describe the
expanded PR test coverage, using the matrix/output symbols already present in
the job definition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c151c11a-1d80-4cf7-920c-31f30c29e97b
📒 Files selected for processing (1)
.github/workflows/python-tests.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/python-tests.yaml:
- Around line 56-60: The matrix job artifact upload is using the same name for
every run, which causes conflicts when multiple `run-index` executions upload in
parallel. Update the workflow job that uses the `matrix` entries for
`python-version` and `run-index` so the `actions/upload-artifact` name includes
`run-index` (and any other matrix identifier needed) to make each artifact name
unique. Use the existing `run-index` matrix symbol to parameterize the artifact
name in the upload step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0c1a6b71-1179-478b-942a-4db5a8d36733
📒 Files selected for processing (2)
.github/workflows/python-tests.yamlpython/packages/jumpstarter/jumpstarter/exporter/hooks.py
🚧 Files skipped from review as they are similar to previous changes (1)
- python/packages/jumpstarter/jumpstarter/exporter/hooks.py
| fail-fast: false | ||
| matrix: | ||
| runs-on: ${{ fromJson(needs.changes.outputs.runners) }} | ||
| # Floor: oldest Python in supported platforms (RHEL 9 appstream) | ||
| # Ceiling: newest Python in latest Fedora | ||
| # Review on each RHEL/Fedora release | ||
| # PRs run only 3.12 on Linux; merge queue runs all versions | ||
| # on both Linux and macOS. | ||
| python-version: ${{ fromJson(needs.changes.outputs.python-versions) }} | ||
| run-index: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
sed -n '1,220p' .github/workflows/python-tests.yaml
Repository: jumpstarter-dev/jumpstarter
Length of output: 7886
🌐 Web query:
actions/upload-artifact repeated artifact name same workflow run matrix jobs behavior v7
💡 Result:
In GitHub Actions, using the same artifact name across multiple jobs in a matrix (or within the same workflow run) is not supported in actions/upload-artifact version 4 and later (including v7) [1][2][3]. Key points regarding this behavior: - Immutability and Conflict Errors: Unlike earlier versions (e.g., v3), which allowed multiple uploads to overwrite or merge into a single artifact, v4+ artifacts are immutable [1][2][3]. Attempting to upload multiple artifacts with the same name during a single workflow run will trigger a conflict error (HTTP 409) [4][2][3]. - Recommended Solution: You must ensure that each artifact name is unique within the workflow run [1][2][3]. A common practice is to suffix or prefix the artifact name with matrix variables, such as name: artifact-${{ matrix.os }} [1][4][5]. - Downloading Merged Artifacts: To reconstruct a single collection of files from these uniquely named artifacts, use actions/download-artifact with the pattern and merge-multiple inputs [1][4][6]. Setting merge-multiple: true allows you to download all artifacts matching a specific pattern (e.g., pattern: artifact-*) into a single directory [4][6]. This change was introduced to prevent accidental data corruption and undefined behavior caused by concurrent, conflicting writes to the same artifact storage [1][2][3].
Citations:
- 1: https://github.blog/news-insights/product-news/get-started-with-v4-of-github-actions-artifacts/
- 2: https://raw.githubusercontent.com/actions/upload-artifact/main/README.md
- 3: https://github.com/marketplace/actions/upload-a-build-artifact
- 4: [bug] (v4) Unable to upload to same artifact name from multiple jobs actions/upload-artifact#478
- 5: Uploading to the same named Artifact multiple times -> Why the breaking change? actions/upload-artifact#469
- 6: https://github.com/actions/upload-artifact/blob/main/docs/MIGRATION.md
Include run-index in the artifact name. Each matrix job currently uploads to the same actions/upload-artifact name, which conflicts in v4/v7 and fails the later uploads.
Suggested fix
- name: Upload test logs
if: always()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7
with:
- name: test-logs-${{ matrix.runs-on }}-py${{ matrix.python-version }}
+ name: test-logs-${{ matrix.runs-on }}-py${{ matrix.python-version }}-run${{ matrix.run-index }}
path: ${{ runner.temp }}/test-logs/
if-no-files-found: ignore
retention-days: 7
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/python-tests.yaml around lines 56 - 60, The matrix job
artifact upload is using the same name for every run, which causes conflicts
when multiple `run-index` executions upload in parallel. Update the workflow job
that uses the `matrix` entries for `python-version` and `run-index` so the
`actions/upload-artifact` name includes `run-index` (and any other matrix
identifier needed) to make each artifact name unique. Use the existing
`run-index` matrix symbol to parameterize the artifact name in the upload step.
| echo 'python-versions=["3.11", "3.12", "3.13"]' >> "$GITHUB_OUTPUT" | ||
| echo 'runners=["macos-15"]' >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
| echo 'python-versions=["3.11", "3.12", "3.13"]' >> "$GITHUB_OUTPUT" | |
| echo 'runners=["macos-15"]' >> "$GITHUB_OUTPUT" | |
| echo 'python-versions=["3.12"]' >> "$GITHUB_OUTPUT" | |
| echo 'runners=["ubuntu-24.04"]' >> "$GITHUB_OUTPUT" |
| if: needs.changes.outputs.should_run == 'true' || github.event_name == 'workflow_dispatch' | ||
| runs-on: ${{ matrix.runs-on }} | ||
| strategy: | ||
| fail-fast: false |
There was a problem hiding this comment.
let's keep fail fast as this... it's very annoying to not get to know if the other variants work or not when one fails.
| # PRs run only 3.12 on Linux; merge queue runs all versions | ||
| # on both Linux and macOS. | ||
| python-version: ${{ fromJson(needs.changes.outputs.python-versions) }} | ||
| run-index: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] |
There was a problem hiding this comment.
| run-index: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] |
| async def test_reader_grace_constant_is_reasonable(self) -> None: | ||
| assert READER_GRACE_SECONDS == 2.0 |
There was a problem hiding this comment.
| async def test_reader_grace_constant_is_reasonable(self) -> None: | |
| assert READER_GRACE_SECONDS == 2.0 |
mangelajo
left a comment
There was a problem hiding this comment.
looks good, much simpler. let's just remove the remains of forced retesting
ca2530d to
2b33f9a
Compare
mangelajo
left a comment
There was a problem hiding this comment.
looks good only those changes needed
|
oh no, it's really introducing regressions in E2E/linux for the hooks |
Replace start_new_session=True with process_group=0 to prevent macOS PTY revocation on subprocess exit. Restructure the reader loop to always attempt os.read() before checking the stop flag, preventing event-loop scheduling starvation from skipping all reads. These two changes address the root cause of the flaky macOS PTY tests (jumpstarter-dev#560, jumpstarter-dev#733, jumpstarter-dev#821, jumpstarter-dev#826) rather than the symptoms. The drain retry logic from jumpstarter-dev#826 is no longer needed and is removed. Closes jumpstarter-dev#821 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run the full test matrix (all Python versions, Linux + macOS) on PRs to confirm the PTY race condition fix passes consistently on macOS. This will be reverted once 20 consecutive macOS passes are confirmed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test for the grandchild scenario where the PTY slave is held open after the direct child exits. The reader gets BlockingIOError (no data) and exits via the stop flag. This covers lines 357-358 in hooks.py to satisfy the diff-coverage threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous test mocked os.read to simulate BlockingIOError, but on Linux the reader gets EOF before BlockingIOError. Use a real backgrounded subprocess (sleep &) to hold the PTY slave open, which forces the reader to exit via the reader_stop + BlockingIOError path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BlockingIOError means "no data yet" (transient), not EOF. Setting eof_seen on BlockingIOError caused the mock to inject drain data during the main reader loop on macOS, where BlockingIOError can occur before the real EOF arrives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporarily run only macOS runners with 20 repetitions via matrix index to confirm the PTY race condition fix is stable. Will be reverted after validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… empty On macOS, PTY internal buffer delivery can lag behind slave closure for very fast commands. The drain must keep polling (bounded by DRAIN_TIMEOUT_SECONDS) rather than breaking on the first empty select(), giving the kernel time to deliver remaining data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the 0.2s grace period with DRAIN_TIMEOUT_SECONDS (2s) after subprocess exit. This gives the macOS PTY kernel buffer sufficient time to deliver data to the master fd before the reader_stop flag forces the reader to exit. The 0.2s window was the fundamental cause of the remaining failures: on loaded macOS CI runners, the PTY buffer delivery can lag behind slave closure by more than 200ms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The PTY fix adds a 2s grace period after each hook subprocess exits, giving the macOS PTY kernel buffer time to deliver data. With ~24 hook e2e tests running both beforeLease and afterLease hooks, this adds ~2-3 minutes total. Bump the timeout to accommodate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the xfail markers added by upstream for the macOS PTY race condition. The fix in this PR (process_group=0 + read-before-stop + drain with continue + extended grace period) addresses the root cause, making these markers unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6e9fa74 to
e06ac52
Compare
Summary
Fixes the root cause of flaky PTY output tests on macOS (#821, previously addressed as symptoms in #560, #733, #826).
The bug has two compounding races:
Event-loop scheduling starvation: The reader task checks
while not pty_state.reader_stop:at loop entry. For fast-completing subprocesses,wait_for_processcan setreader_stop = Truebefore the reader ever gets scheduled, causing it to skip all reads entirely.macOS PTY revocation:
start_new_session=Truemakes the subprocess a session leader. On macOS (BSD), when a session leader exits,revoke(2)is called on the terminal device, which can discard buffered data -- making the drain (the only remaining capture path after Race 1) unreliable.Changes
start_new_session=True->process_group=0: Creates a new process group without a new session. No session leader means no PTY revocation on macOS. The code never uses session-based operations (os.killpg, session signals) -- onlyprocess.terminate()/process.kill()-- so nothing else changes.Reader loop restructured to read-before-stop: Changed from
while not reader_stop(can skip the entire loop) towhile Truewithreader_stopchecked only inside theexcept BlockingIOErrorhandler. The reader always attemptsos.read()before considering the stop flag.Removed
DRAIN_MAX_EMPTY_POLLSretry logic: With the root causes fixed, the drain retry mechanism from fix: retry PTY drain on empty select() to prevent data loss on macOS #826 is unnecessary. Reverted the drain to a simple single-pass safety net.Test plan
ruff check)Closes #821
🤖 Generated with Claude Code