Skip to content

fix(hooks): eliminate macOS PTY output race condition#837

Open
raballew wants to merge 11 commits into
jumpstarter-dev:mainfrom
raballew:fix/pty-read-before-stop
Open

fix(hooks): eliminate macOS PTY output race condition#837
raballew wants to merge 11 commits into
jumpstarter-dev:mainfrom
raballew:fix/pty-read-before-stop

Conversation

@raballew

Copy link
Copy Markdown
Member

Summary

Fixes the root cause of flaky PTY output tests on macOS (#821, previously addressed as symptoms in #560, #733, #826).

The bug has two compounding races:

  1. Event-loop scheduling starvation: The reader task checks while not pty_state.reader_stop: at loop entry. For fast-completing subprocesses, wait_for_process can set reader_stop = True before the reader ever gets scheduled, causing it to skip all reads entirely.

  2. macOS PTY revocation: start_new_session=True makes the subprocess a session leader. On macOS (BSD), when a session leader exits, revoke(2) is called on the terminal device, which can discard buffered data -- making the drain (the only remaining capture path after Race 1) unreliable.

Changes

  • start_new_session=True -> process_group=0: Creates a new process group without a new session. No session leader means no PTY revocation on macOS. The code never uses session-based operations (os.killpg, session signals) -- only process.terminate()/process.kill() -- so nothing else changes.

  • Reader loop restructured to read-before-stop: Changed from while not reader_stop (can skip the entire loop) to while True with reader_stop checked only inside the except BlockingIOError handler. The reader always attempts os.read() before considering the stop flag.

  • Removed DRAIN_MAX_EMPTY_POLLS retry logic: With the root causes fixed, the drain retry mechanism from fix: retry PTY drain on empty select() to prevent data loss on macOS #826 is unnecessary. Reverted the drain to a simple single-pass safety net.

Test plan

  • All 53 hooks tests pass locally on Linux (Python 3.12)
  • Linter passes (ruff check)
  • macOS CI passes consistently (the previously flaky tests)

Closes #821

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates PTY hook execution and drain handling, adjusts the PTY tests to match the new behavior, and expands the pull request Python test matrix.

Changes

PTY Drain Logic Simplification

Layer / File(s) Summary
PTY reader and drain behavior
python/packages/jumpstarter/jumpstarter/exporter/hooks.py
Changes subprocess spawn to process_group=0, refactors the PTY reader loop to handle reader_stop during non-blocking reads, and simplifies final drain termination to use select() timeout with byte/time bounds.

PTY Drain Test Updates

Layer / File(s) Summary
Drain tests and PTY mock behavior
python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
Removes the DRAIN_MAX_EMPTY_POLLS import, updates PTY mock EOF handling, and replaces drain retry tests with a stop-flag test and constant checks.

Python Test Matrix Update

Layer / File(s) Summary
PR matrix outputs
.github/workflows/python-tests.yaml
The pull_request matrix now outputs Python 3.11, 3.12, and 3.13 on ubuntu-24.04 and macos-15.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • bennyz
  • kirkbrauer
  • evakhoni

Poem

🐇 I hopped through PTYs, light and spry,
With bytes and time to catch each cry.
New tests now tumble, clear and neat,
While runners stretch across the fleet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the macOS PTY race fix in hooks.
Description check ✅ Passed The description clearly matches the PTY race fix and test/workflow changes in the PR.
Linked Issues check ✅ Passed The changes address the reported macOS PTY race by fixing reader timing, avoiding session-leader revocation, and updating tests/CI.
Out of Scope Changes check ✅ Passed The workflow and test updates are directly tied to validating the macOS PTY fix, with no obvious unrelated changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/python-tests.yaml:
- Around line 44-45: Update the stale matrix comments in the workflow so they
match the current behavior. The comments near the matrix setup and downstream
job reference only Python 3.12 on Linux, but the actual matrix in the workflow
now includes Python 3.11, 3.12, and 3.13 on both ubuntu-24.04 and macos-15.
Adjust the comments in the relevant workflow sections to accurately describe the
expanded PR test coverage, using the matrix/output symbols already present in
the job definition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c151c11a-1d80-4cf7-920c-31f30c29e97b

📥 Commits

Reviewing files that changed from the base of the PR and between 90695e1 and c2ed2a6.

📒 Files selected for processing (1)
  • .github/workflows/python-tests.yaml

Comment thread .github/workflows/python-tests.yaml Outdated
Comment thread python/packages/jumpstarter/jumpstarter/exporter/hooks.py
Comment thread python/packages/jumpstarter/jumpstarter/exporter/hooks.py
Comment thread .github/workflows/python-tests.yaml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/python-tests.yaml:
- Around line 56-60: The matrix job artifact upload is using the same name for
every run, which causes conflicts when multiple `run-index` executions upload in
parallel. Update the workflow job that uses the `matrix` entries for
`python-version` and `run-index` so the `actions/upload-artifact` name includes
`run-index` (and any other matrix identifier needed) to make each artifact name
unique. Use the existing `run-index` matrix symbol to parameterize the artifact
name in the upload step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0c1a6b71-1179-478b-942a-4db5a8d36733

📥 Commits

Reviewing files that changed from the base of the PR and between 3a84cd7 and 2b33f9a.

📒 Files selected for processing (2)
  • .github/workflows/python-tests.yaml
  • python/packages/jumpstarter/jumpstarter/exporter/hooks.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/packages/jumpstarter/jumpstarter/exporter/hooks.py

Comment on lines +56 to +60
fail-fast: false
matrix:
runs-on: ${{ fromJson(needs.changes.outputs.runners) }}
# Floor: oldest Python in supported platforms (RHEL 9 appstream)
# Ceiling: newest Python in latest Fedora
# Review on each RHEL/Fedora release
# PRs run only 3.12 on Linux; merge queue runs all versions
# on both Linux and macOS.
python-version: ${{ fromJson(needs.changes.outputs.python-versions) }}
run-index: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '1,220p' .github/workflows/python-tests.yaml

Repository: jumpstarter-dev/jumpstarter

Length of output: 7886


🌐 Web query:

actions/upload-artifact repeated artifact name same workflow run matrix jobs behavior v7

💡 Result:

In GitHub Actions, using the same artifact name across multiple jobs in a matrix (or within the same workflow run) is not supported in actions/upload-artifact version 4 and later (including v7) [1][2][3]. Key points regarding this behavior: - Immutability and Conflict Errors: Unlike earlier versions (e.g., v3), which allowed multiple uploads to overwrite or merge into a single artifact, v4+ artifacts are immutable [1][2][3]. Attempting to upload multiple artifacts with the same name during a single workflow run will trigger a conflict error (HTTP 409) [4][2][3]. - Recommended Solution: You must ensure that each artifact name is unique within the workflow run [1][2][3]. A common practice is to suffix or prefix the artifact name with matrix variables, such as name: artifact-${{ matrix.os }} [1][4][5]. - Downloading Merged Artifacts: To reconstruct a single collection of files from these uniquely named artifacts, use actions/download-artifact with the pattern and merge-multiple inputs [1][4][6]. Setting merge-multiple: true allows you to download all artifacts matching a specific pattern (e.g., pattern: artifact-*) into a single directory [4][6]. This change was introduced to prevent accidental data corruption and undefined behavior caused by concurrent, conflicting writes to the same artifact storage [1][2][3].

Citations:


Include run-index in the artifact name. Each matrix job currently uploads to the same actions/upload-artifact name, which conflicts in v4/v7 and fails the later uploads.

Suggested fix
       - name: Upload test logs
         if: always()
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7
         with:
-          name: test-logs-${{ matrix.runs-on }}-py${{ matrix.python-version }}
+          name: test-logs-${{ matrix.runs-on }}-py${{ matrix.python-version }}-run${{ matrix.run-index }}
           path: ${{ runner.temp }}/test-logs/
           if-no-files-found: ignore
           retention-days: 7
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/python-tests.yaml around lines 56 - 60, The matrix job
artifact upload is using the same name for every run, which causes conflicts
when multiple `run-index` executions upload in parallel. Update the workflow job
that uses the `matrix` entries for `python-version` and `run-index` so the
`actions/upload-artifact` name includes `run-index` (and any other matrix
identifier needed) to make each artifact name unique. Use the existing
`run-index` matrix symbol to parameterize the artifact name in the upload step.

Comment on lines +44 to +45
echo 'python-versions=["3.11", "3.12", "3.13"]' >> "$GITHUB_OUTPUT"
echo 'runners=["macos-15"]' >> "$GITHUB_OUTPUT"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo 'python-versions=["3.11", "3.12", "3.13"]' >> "$GITHUB_OUTPUT"
echo 'runners=["macos-15"]' >> "$GITHUB_OUTPUT"
echo 'python-versions=["3.12"]' >> "$GITHUB_OUTPUT"
echo 'runners=["ubuntu-24.04"]' >> "$GITHUB_OUTPUT"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just put this back

if: needs.changes.outputs.should_run == 'true' || github.event_name == 'workflow_dispatch'
runs-on: ${{ matrix.runs-on }}
strategy:
fail-fast: false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep fail fast as this... it's very annoying to not get to know if the other variants work or not when one fails.

# PRs run only 3.12 on Linux; merge queue runs all versions
# on both Linux and macOS.
python-version: ${{ fromJson(needs.changes.outputs.python-versions) }}
run-index: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run-index: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

Comment on lines +518 to +519
async def test_reader_grace_constant_is_reasonable(self) -> None:
assert READER_GRACE_SECONDS == 2.0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def test_reader_grace_constant_is_reasonable(self) -> None:
assert READER_GRACE_SECONDS == 2.0

@mangelajo mangelajo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, much simpler. let's just remove the remains of forced retesting

@raballew raballew force-pushed the fix/pty-read-before-stop branch from ca2530d to 2b33f9a Compare June 24, 2026 16:43

@mangelajo mangelajo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good only those changes needed

@mangelajo mangelajo added this to the 0.9.0 milestone Jun 24, 2026
@mangelajo

mangelajo commented Jun 25, 2026

Copy link
Copy Markdown
Member

oh no, it's really introducing regressions in E2E/linux for the hooks

https://github.com/jumpstarter-dev/jumpstarter/actions/runs/28114462001/job/83374717893?pr=837#step:10:1194

Hooks E2E Tests Group H: PR Regression Tests H2: beforeLease fail+exit does NOT run afterLease hook [hooks]
/home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410
  > Enter [It] H2: beforeLease fail+exit does NOT run afterLease hook @ 06/25/26 07:33:35.26
  Started exporter test-exporter-hooks (PID 23141)
  [TIMEDOUT] in [It] - /home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410 @ 06/25/26 07:37:06.917
  < Exit [It] H2: beforeLease fail+exit does NOT run afterLease hook @ 06/25/26 07:37:06.918 (3m31.658s)
  > Enter [AfterEach] Hooks E2E Tests @ 06/25/26 07:37:06.918
  < Exit [AfterEach] Hooks E2E Tests @ 06/25/26 07:37:06.918 (0s)
  > Enter [AfterAll] Hooks E2E Tests @ 06/25/26 07:37:06.918
  < Exit [AfterAll] Hooks E2E Tests @ 06/25/26 07:37:08.649 (1.732s)
• [TIMEDOUT] [213.389 seconds]
Hooks E2E Tests Group H: PR Regression Tests [It] H2: beforeLease fail+exit does NOT run afterLease hook [hooks]
/home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410

  [TIMEDOUT] A suite timeout occurred
  In [It] at: /home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410 @ 06/25/26 07:37:06.917

  Full Stack Trace

  This is the Progress Report generated when the suite timeout occurred:
    Hooks E2E Tests Group H: PR Regression Tests H2: beforeLease fail+exit does NOT run afterLease hook (Spec Runtime: 3m31.657s)
      /home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410
      In [It] (Node Runtime: 3m31.657s)
        /home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410

      Spec Goroutine
      goroutine 833 [syscall, 3 minutes]
        syscall.Syscall6(0xf7, 0x3, 0xc, 0x161486d7fbb0, 0x4, 0x161486dc8240, 0x0)
          /opt/hostedtoolcache/go/1.26.3/x64/src/syscall/syscall_linux.go:96
        internal/syscall/unix.Waitid(0x161486d7fd00?, 0x56976d?, 0x161486dee340?, 0x1?, 0x161486b74510?)
          /opt/hostedtoolcache/go/1.26.3/x64/src/internal/syscall/unix/waitid_linux.go:18
        os.(*Process).pidfdWait.func1(...)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/pidfd_linux.go:108
        os.ignoringEINTR(...)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/file_posix.go:256
        os.(*Process).pidfdWait(0x161486d9c400)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/pidfd_linux.go:107
        os.(*Process).wait(0x569a40?)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec_unix.go:25
        os.(*Process).Wait(...)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec.go:346
        os/exec.(*Cmd).Wait(0x161486c28000)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec/exec.go:930
        os/exec.(*Cmd).Run(0x161486c28000)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec/exec.go:632
      > github.com/jumpstarter-dev/jumpstarter/e2e/test.RunCmd({0xa5bf5a?, 0xa81ccf?}, {0x161486d7fe08?, 0x161486aa4520?, 0x161486d21450?})
          /home/runner/work/jumpstarter/jumpstarter/e2e/test/utils.go:156
            | 	cmd.Stdout = &out
            | 	cmd.Stderr = &out
            > 	err := cmd.Run()
            | 	return strings.TrimSpace(out.String()), err
            | }
      > github.com/jumpstarter-dev/jumpstarter/e2e/test.Jmp({0x161486d7fe08, 0x8, 0x8})
          /home/runner/work/jumpstarter/jumpstarter/e2e/test/utils.go:213
            | // Jmp runs a jmp CLI command and returns the output.
            | func Jmp(args ...string) (string, error) {
            > 	return RunCmd(JmpPath(), args...)
            | }
            | 
      > github.com/jumpstarter-dev/jumpstarter/e2e/test.init.func6.14.2()
          /home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:413
            | startHooksExporterSingle("exporter-hooks-before-fail-exit-with-after.yaml")
            | 
            > out, err := Jmp("shell", "--client", "test-client-hooks",
            | 	"--selector", "example.com/board=hooks", "j", "power", "on")
            | Expect(err).To(HaveOccurred())
        github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x161486b11f68?, 0x0?})
          /home/runner/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.28.1/internal/node.go:585
        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
          /home/runner/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.28.1/internal/suite.go:946
        github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 7
          /home/runner/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.28.1/internal/suite.go:911

      Goroutines of Interest
      goroutine 853 [syscall, 3 minutes]
        syscall.Syscall6(0xf7, 0x3, 0xb, 0x161486bb3e68, 0x4, 0x161486dc8000, 0x0)
          /opt/hostedtoolcache/go/1.26.3/x64/src/syscall/syscall_linux.go:96
        internal/syscall/unix.Waitid(0x0?, 0x161486bb3e98?, 0x46c3d7?, 0x0?, 0xea4a30?)
          /opt/hostedtoolcache/go/1.26.3/x64/src/internal/syscall/unix/waitid_linux.go:18
        os.(*Process).pidfdWait.func1(...)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/pidfd_linux.go:108
        os.ignoringEINTR(...)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/file_posix.go:256
        os.(*Process).pidfdWait(0x161486d9c140)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/pidfd_linux.go:107
        os.(*Process).wait(0x0?)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec_unix.go:25
        os.(*Process).Wait(...)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec.go:346
        os/exec.(*Cmd).Wait(0x161486d37a00)
          /opt/hostedtoolcache/go/1.26.3/x64/src/os/exec/exec.go:930
      > github.com/jumpstarter-dev/jumpstarter/e2e/test.(*ProcessTracker).StartExporterSingle.func1()
          /home/runner/work/jumpstarter/jumpstarter/e2e/test/utils.go:422
            | // Reap the child process in the background so it doesn't become a zombie.
            | go func() {
            > 	_ = cmd.Wait()
            | }()
            | 
      > github.com/jumpstarter-dev/jumpstarter/e2e/test.(*ProcessTracker).StartExporterSingle in goroutine 833
          /home/runner/work/jumpstarter/jumpstarter/e2e/test/utils.go:421
            | 
            | // Reap the child process in the background so it doesn't become a zombie.
            > go func() {
            | 	_ = cmd.Wait()
            | }()
------------------------------
SSSSSSSSSS

Summarizing 1 Failure:
  [TIMEDOUT] Hooks E2E Tests Group H: PR Regression Tests [It] H2: beforeLease fail+exit does NOT run afterLease hook [hooks]
  /home/runner/work/jumpstarter/jumpstarter/e2e/test/hooks_test.go:410

Ran 69 of 91 Specs in 1794.185 seconds
FAIL! - Suite Timeout Elapsed -- 68 Passed | 1 Failed | 0 Pending | 22 Skipped

@mangelajo mangelajo modified the milestones: 0.9.0, 0.10.0 Jun 25, 2026
raballew and others added 11 commits June 25, 2026 16:16
Replace start_new_session=True with process_group=0 to prevent macOS
PTY revocation on subprocess exit. Restructure the reader loop to
always attempt os.read() before checking the stop flag, preventing
event-loop scheduling starvation from skipping all reads.

These two changes address the root cause of the flaky macOS PTY tests
(jumpstarter-dev#560, jumpstarter-dev#733, jumpstarter-dev#821, jumpstarter-dev#826) rather than the symptoms. The drain retry
logic from jumpstarter-dev#826 is no longer needed and is removed.

Closes jumpstarter-dev#821

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run the full test matrix (all Python versions, Linux + macOS) on PRs
to confirm the PTY race condition fix passes consistently on macOS.
This will be reverted once 20 consecutive macOS passes are confirmed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test for the grandchild scenario where the PTY slave is held open
after the direct child exits. The reader gets BlockingIOError (no
data) and exits via the stop flag. This covers lines 357-358 in
hooks.py to satisfy the diff-coverage threshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous test mocked os.read to simulate BlockingIOError, but on
Linux the reader gets EOF before BlockingIOError. Use a real
backgrounded subprocess (sleep &) to hold the PTY slave open, which
forces the reader to exit via the reader_stop + BlockingIOError path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BlockingIOError means "no data yet" (transient), not EOF. Setting
eof_seen on BlockingIOError caused the mock to inject drain data
during the main reader loop on macOS, where BlockingIOError can occur
before the real EOF arrives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporarily run only macOS runners with 20 repetitions via matrix
index to confirm the PTY race condition fix is stable. Will be
reverted after validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… empty

On macOS, PTY internal buffer delivery can lag behind slave closure
for very fast commands. The drain must keep polling (bounded by
DRAIN_TIMEOUT_SECONDS) rather than breaking on the first empty
select(), giving the kernel time to deliver remaining data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the 0.2s grace period with DRAIN_TIMEOUT_SECONDS (2s) after
subprocess exit. This gives the macOS PTY kernel buffer sufficient
time to deliver data to the master fd before the reader_stop flag
forces the reader to exit.

The 0.2s window was the fundamental cause of the remaining failures:
on loaded macOS CI runners, the PTY buffer delivery can lag behind
slave closure by more than 200ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The PTY fix adds a 2s grace period after each hook subprocess exits,
giving the macOS PTY kernel buffer time to deliver data. With ~24
hook e2e tests running both beforeLease and afterLease hooks, this
adds ~2-3 minutes total. Bump the timeout to accommodate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the xfail markers added by upstream for the macOS PTY race
condition. The fix in this PR (process_group=0 + read-before-stop +
drain with continue + extended grace period) addresses the root cause,
making these markers unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@raballew raballew force-pushed the fix/pty-read-before-stop branch from 6e9fa74 to e06ac52 Compare June 25, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test_exec_bash in hooks_test.py on macOS still reproducing after previous fixes

2 participants