CI infrastructure improvements: heartbeat, dump checks, timeouts, logging#15950
CI infrastructure improvements: heartbeat, dump checks, timeouts, logging#15950radical merged 7 commits intomicrosoft:mainfrom
Conversation
The 5-second interval generates excessive log output during CI test runs. A 60-second default reduces noise while still providing periodic status updates for diagnosing runner hangs and disk space issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Linux test steps were relying on the default interval. Now that the default changed to 60s this is technically a no-op, but passing it explicitly makes the intent clear and matches the Windows steps which already specified 60s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The dump file check was in specialized-test-runner.yml and checked artifacts/all-logs after downloading from sub-jobs. Move it to run-tests.yml where it checks testresults/ directly in the same job, matching the actual --crashdump/--hangdump output directory. This catches crashes/timeouts earlier and in the correct job context. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add testSessionTimeout and testHangTimeout MSBuild properties to the JSON emitted by SpecializedTestRunsheetBuilderBase.targets, allowing per-project timeout overrides to flow through the runsheet to the CI workflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The old message 'filtering to single test project' was ambiguous. Make it clear this is a sanity-check behavior for pull_request events. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15950Or
iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15950" |
There was a problem hiding this comment.
Pull request overview
This PR refines Aspire’s CI test infrastructure to reduce log noise and improve correctness of failure detection and timeout propagation across workflows.
Changes:
- Adjust heartbeat monitoring cadence (default + explicit CI invocation) to reduce noise/overhead.
- Move dump-file detection into the actual per-test execution job (
run-tests.yml) instead of the specialized runner’s final aggregation job. - Ensure specialized-test runsheets include
testSessionTimeout/testHangTimeoutso per-project overrides can flow into CI.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tools/scripts/Heartbeat.cs | Changes the script’s default heartbeat interval from 5s to 60s. |
| eng/SpecializedTestRunsheetBuilderBase.targets | Adds timeout fields to specialized test runsheet JSON entries. |
| eng/AfterSolutionBuild.targets | Clarifies the pull_request specialized-test filtering log message. |
| .github/workflows/specialized-test-runner.yml | Removes dump-file checking from the final results aggregation job. |
| .github/workflows/run-tests.yml | Passes explicit 60s heartbeat interval and adds a per-job dump-file check in the correct testresults/ context. |
Addresses review feedback: reject zero or negative interval values that would cause a tight loop or crash in Task.Delay. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
|
Crash dumps (e.g. *_crash.dmp) can be produced during process cleanup even when all tests passed — this is benign. Only fail on hang dump files (*hangdump*) which indicate the test runner timed out and tests may not have completed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
|
JamesNK
left a comment
There was a problem hiding this comment.
Clean, low-risk CI infrastructure improvements. The dump check relocation correctly moves detection into the job context where testresults/ exists, and filtering to *hangdump* avoids false positives from crash dumps. Runsheet timeout fields are consistent with the existing TestEnumerationRunsheetBuilder pattern. LGTM.
Description
A collection of CI infrastructure improvements for test workflows:
Heartbeat.csdefault interval from 5s to 60s, and pass 60s explicitly in the CI workflow for clarity.specialized-test-runner.ymlfinal results step intorun-tests.ymlwhere it runs in the correct job context against the actualtestresults/directory.testSessionTimeoutandtestHangTimeoutinSpecializedTestRunsheetBuilderBase.targetsso per-project timeout overrides flow through the runsheet to CI.Checklist
<remarks />and<code />elements on your triple slash comments?aspire.devissue: