[hotfix] Fix flaky tests#28106
Open
Dennis-Mircea wants to merge 6 commits intoapache:masterfrom
Open
Conversation
…ingWaitingForFetcherShutdown flaky test
…TerminatedRescaleMergingWithNewRecoverableFailureTriggerCause flaky test
…estCheckpointDrain flaky test
…stMultipleJobVertexFinishedTaskExceedRatio flaky test
Collaborator
…treamOperatorTest
davidradl
reviewed
May 5, 2026
davidradl
reviewed
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
This pull request stabilizes four flaky tests that have been observed failing intermittently on Flink CI. All four flakes have the same root cause pattern: assertions race against asynchronous bookkeeping or against tasks that have not yet accumulated enough wall-clock execution time to satisfy the assertion. The fixes make the tests deterministic without changing any production behavior.
Brief change log
SplitFetcherManagerTest#testCloseBlockingWaitingForFetcherShutdown: replace the strict equality check on the number of fetcher-manager threads (== 2) with a polling wait on the element-queue draining thread being started, and tolerate eitherWAITINGorTIMED_WAITINGstates. This removes the timing assumption that the draining thread has spawned by the time the polling loop first runs.RescaleTimelineITCase#testRecordNonTerminatedRescaleMergingWithNewRecoverableFailureTriggerCause: when rescale-history is enabled, poll the rescale history until the still-openUPDATE_REQUIREMENTrescale has been merged with the newRECOVERABLE_FAILOVERone before snapshotting theExecutionGraphInfo. The merge is recorded asynchronously by the scheduler and is not synchronized with the parallelism /RUNNINGsignals the test waits on.AbstractAsyncRunnableStreamOperatorTest#testCheckpointDrain: gate the supplier passed toasyncProcesson aCompletableFuturethat is only completed after the in-flight-record assertion has run. Previously the request could complete on a fast machine before thegetInFlightRecordNum() == 1check, intermittently observing0.ExecutionTimeBasedSlowTaskDetectorTest#testMultipleJobVertexFinishedTaskExceedRatio: insert a short sleep between marking the baseline tasksFINISHEDand invoking the detector, so that the still-running tasks have a strictly larger accumulated execution time than the baseline. On fast machines the whole sequence could complete within the same millisecond, leaving the running tasks withexecutionTime <= baselineand producing an empty slow-tasks map.Verifying this change
This change is a test-only stabilization without any production-code changes.
It can be verified as follows:
RescaleTimelineITCaseandSplitFetcherManagerTestcases the polling waits are bounded by an explicit timeout, so a real regression in the merge / draining-thread behavior would still surface as a deterministic failure rather than a hang.Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
Was generative AI tooling used to co-author this PR?
Generated-by: [Github Copilot]