Skip to content

logictest: add retry to flaky regional_by_row_cascade EXPLAIN#164575

Open
mw5h wants to merge 1 commit intocockroachdb:masterfrom
mw5h:fix-regional-by-row-cascade-flake
Open

logictest: add retry to flaky regional_by_row_cascade EXPLAIN#164575
mw5h wants to merge 1 commit intocockroachdb:masterfrom
mw5h:fix-regional-by-row-cascade-flake

Conversation

@mw5h
Copy link
Contributor

@mw5h mw5h commented Feb 28, 2026

Summary

  • Add retry directive to the full EXPLAIN (VERBOSE) query in regional_by_row_cascade to handle a rare stats propagation timing issue that causes intermittent test failures across master, release-26.1, and release-26.1.0-rc.

Fixes: #164001

Test plan

The test itself is the test plan — the retry directive allows the EXPLAIN query to be retried with backoff if the output doesn't match on the first attempt.

The full EXPLAIN (VERBOSE) check in regional_by_row_cascade
intermittently fails because, despite clearing the stats cache
before the query, the optimizer occasionally does not see the
freshly collected stats in time. This causes the plan output
to differ (e.g. "missing stats" instead of actual row count
estimates), which cascades into a full EXPLAIN mismatch.

A retry was originally added in cockroachdb#156823 for a count-based query
in the same file, then removed in cockroachdb#157039 when the metamorphic
batch size issue was identified. The metamorphic fix did not
fully resolve the flake; it continues to appear across master,
release-26.1, and release-26.1.0-rc branches.

Add retry to the full EXPLAIN query to handle the rare stats
propagation timing window.

Fixes: cockroachdb#164001

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@trunk-io
Copy link
Contributor

trunk-io bot commented Feb 28, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mw5h mw5h requested a review from yuzefovich February 28, 2026 01:31
@mw5h mw5h marked this pull request as ready for review February 28, 2026 01:31
@mw5h mw5h requested a review from michae2 February 28, 2026 01:31
@yuzefovich
Copy link
Member

The difference in the failures #164001 and #164399 is actually not the stats issue

Screenshot 2026-02-27 at 8 07 25 PM

It seems as if we rarely choose to use the locking scan as the initial read in the DELETE. I don't see any interesting metamorphic variables. Also we do specify the column family definition, so it's not related to column family randomization. I wonder whether @michae2 has any hypothesis.

Perhaps a retry would fix the flake, but I wonder whether we should just remove the whole output and just count the number of parallel readers (which this test is really about) with something like

query I
SELECT count(*) FROM [EXPLAIN (VERBOSE) DELETE FROM great_grandparent WHERE i = 1] WHERE info LIKE '%parallel%';
----
4

@michae2
Copy link
Collaborator

michae2 commented Feb 28, 2026

I wonder whether @michae2 has any hypothesis.

We expect to use a LOS, but in rare cases we're not. I don't think the LOS depends on stats, but rather regions... my guess is that when these failures happen the gateway node is missing region(s) for some reason.

I'm hoping that #161999 and #164001 and company are the same as #161528 and have just been fixed by #163883.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pkg/ccl/logictestccl/tests/multiregion-9node-3region-3azs/multiregion-9node-3region-3azs_test: TestCCLLogic_regional_by_row_cascade failed

4 participants