Skip to content

CI infrastructure improvements: heartbeat, dump checks, timeouts, logging#15950

Merged
radical merged 7 commits intomicrosoft:mainfrom
radical:copilot/ci-infra-improvements
Apr 8, 2026
Merged

CI infrastructure improvements: heartbeat, dump checks, timeouts, logging#15950
radical merged 7 commits intomicrosoft:mainfrom
radical:copilot/ci-infra-improvements

Conversation

@radical
Copy link
Copy Markdown
Member

@radical radical commented Apr 7, 2026

Description

A collection of CI infrastructure improvements for test workflows:

  • Reduce heartbeat log noise: Change Heartbeat.cs default interval from 5s to 60s, and pass 60s explicitly in the CI workflow for clarity.
  • Move dump file check to individual test jobs: Relocate crash/hang dump detection from the specialized-test-runner.yml final results step into run-tests.yml where it runs in the correct job context against the actual testresults/ directory.
  • Include timeout values in runsheets: Emit testSessionTimeout and testHangTimeout in SpecializedTestRunsheetBuilderBase.targets so per-project timeout overrides flow through the runsheet to CI.
  • Clarify PR filtering log message: Make the specialized test workflow log message unambiguous about pull_request sanity-check behavior.

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?

radical and others added 5 commits April 7, 2026 19:58
The 5-second interval generates excessive log output during CI test runs.
A 60-second default reduces noise while still providing periodic status
updates for diagnosing runner hangs and disk space issues.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Linux test steps were relying on the default interval. Now that the
default changed to 60s this is technically a no-op, but passing it
explicitly makes the intent clear and matches the Windows steps which
already specified 60s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dump file check was in specialized-test-runner.yml and checked
artifacts/all-logs after downloading from sub-jobs. Move it to
run-tests.yml where it checks testresults/ directly in the same job,
matching the actual --crashdump/--hangdump output directory.

This catches crashes/timeouts earlier and in the correct job context.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testSessionTimeout and testHangTimeout MSBuild properties to the
JSON emitted by SpecializedTestRunsheetBuilderBase.targets, allowing
per-project timeout overrides to flow through the runsheet to the
CI workflow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The old message 'filtering to single test project' was ambiguous.
Make it clear this is a sanity-check behavior for pull_request events.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 7, 2026 23:59
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15950

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15950"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines Aspire’s CI test infrastructure to reduce log noise and improve correctness of failure detection and timeout propagation across workflows.

Changes:

  • Adjust heartbeat monitoring cadence (default + explicit CI invocation) to reduce noise/overhead.
  • Move dump-file detection into the actual per-test execution job (run-tests.yml) instead of the specialized runner’s final aggregation job.
  • Ensure specialized-test runsheets include testSessionTimeout / testHangTimeout so per-project overrides can flow into CI.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/scripts/Heartbeat.cs Changes the script’s default heartbeat interval from 5s to 60s.
eng/SpecializedTestRunsheetBuilderBase.targets Adds timeout fields to specialized test runsheet JSON entries.
eng/AfterSolutionBuild.targets Clarifies the pull_request specialized-test filtering log message.
.github/workflows/specialized-test-runner.yml Removes dump-file checking from the final results aggregation job.
.github/workflows/run-tests.yml Passes explicit 60s heartbeat interval and adds a per-job dump-file check in the correct testresults/ context.

Addresses review feedback: reject zero or negative interval values
that would cause a tight loop or crash in Task.Delay.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

Crash dumps (e.g. *_crash.dmp) can be produced during process cleanup
even when all tests passed — this is benign. Only fail on hang dump
files (*hangdump*) which indicate the test runner timed out and tests
may not have completed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

@radical radical enabled auto-merge (squash) April 8, 2026 03:30
Copy link
Copy Markdown
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, low-risk CI infrastructure improvements. The dump check relocation correctly moves detection into the job context where testresults/ exists, and filtering to *hangdump* avoids false positives from crash dumps. Runsheet timeout fields are consistent with the existing TestEnumerationRunsheetBuilder pattern. LGTM.

@radical radical merged commit 4c6f8c7 into microsoft:main Apr 8, 2026
535 of 540 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants