Skip to content

[hotfix] Fix flaky tests#28106

Open
Dennis-Mircea wants to merge 6 commits intoapache:masterfrom
Dennis-Mircea:hotfix/fix-flaky-tests
Open

[hotfix] Fix flaky tests#28106
Dennis-Mircea wants to merge 6 commits intoapache:masterfrom
Dennis-Mircea:hotfix/fix-flaky-tests

Conversation

@Dennis-Mircea
Copy link
Copy Markdown

What is the purpose of the change

This pull request stabilizes four flaky tests that have been observed failing intermittently on Flink CI. All four flakes have the same root cause pattern: assertions race against asynchronous bookkeeping or against tasks that have not yet accumulated enough wall-clock execution time to satisfy the assertion. The fixes make the tests deterministic without changing any production behavior.

Brief change log

  • SplitFetcherManagerTest#testCloseBlockingWaitingForFetcherShutdown: replace the strict equality check on the number of fetcher-manager threads (== 2) with a polling wait on the element-queue draining thread being started, and tolerate either WAITING or TIMED_WAITING states. This removes the timing assumption that the draining thread has spawned by the time the polling loop first runs.
  • RescaleTimelineITCase#testRecordNonTerminatedRescaleMergingWithNewRecoverableFailureTriggerCause: when rescale-history is enabled, poll the rescale history until the still-open UPDATE_REQUIREMENT rescale has been merged with the new RECOVERABLE_FAILOVER one before snapshotting the ExecutionGraphInfo. The merge is recorded asynchronously by the scheduler and is not synchronized with the parallelism / RUNNING signals the test waits on.
  • AbstractAsyncRunnableStreamOperatorTest#testCheckpointDrain: gate the supplier passed to asyncProcess on a CompletableFuture that is only completed after the in-flight-record assertion has run. Previously the request could complete on a fast machine before the getInFlightRecordNum() == 1 check, intermittently observing 0.
  • ExecutionTimeBasedSlowTaskDetectorTest#testMultipleJobVertexFinishedTaskExceedRatio: insert a short sleep between marking the baseline tasks FINISHED and invoking the detector, so that the still-running tasks have a strictly larger accumulated execution time than the baseline. On fast machines the whole sequence could complete within the same millisecond, leaving the running tasks with executionTime <= baseline and producing an empty slow-tasks map.

Verifying this change

This change is a test-only stabilization without any production-code changes.

It can be verified as follows:

  • Each of the four affected tests was executed in a tight loop locally (≥ 50 iterations per test, single-threaded and concurrent) and no failures were observed after the fixes; before the fixes the same loops reproduced the flakes documented in the linked CI runs.
  • For the RescaleTimelineITCase and SplitFetcherManagerTest cases the polling waits are bounded by an explicit timeout, so a real regression in the merge / draining-thread behavior would still surface as a deterministic failure rather than a hang.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: [Github Copilot]

…TerminatedRescaleMergingWithNewRecoverableFailureTriggerCause flaky test
…stMultipleJobVertexFinishedTaskExceedRatio flaky test
@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented May 4, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Dennis-Mircea Dennis-Mircea requested a review from davidradl May 6, 2026 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants